CN117351115A - Training method of image generation model, image generation method, device and equipment - Google Patents

Training method of image generation model, image generation method, device and equipment Download PDF

Info

Publication number
CN117351115A
CN117351115A CN202311230141.8A CN202311230141A CN117351115A CN 117351115 A CN117351115 A CN 117351115A CN 202311230141 A CN202311230141 A CN 202311230141A CN 117351115 A CN117351115 A CN 117351115A
Authority
CN
China
Prior art keywords
image
text
objects
combined
generation model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311230141.8A
Other languages
Chinese (zh)
Inventor
郭卉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311230141.8A priority Critical patent/CN117351115A/en
Publication of CN117351115A publication Critical patent/CN117351115A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/587Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a training method of an image generation model, an image generation method, an image generation device and equipment, and belongs to the technical field of computers. The method comprises the following steps: acquiring single object image-text pairs of a plurality of objects, wherein the single object image-text pairs of each object comprise a single object image and a single object text, the single object image of each object comprises the objects, and the single object text of each object comprises the names and the categories of the objects; generating at least one combined image-text pair based on single object image-text pairs of the plurality of objects, wherein each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the name, the category and the position of each object in the corresponding combined image; and training the image generation model based on the single object image-text pairs and at least one combined image-text pair of the plurality of objects. The technical scheme is beneficial to generating a high-quality combined image by the image generation model.

Description

Training method of image generation model, image generation method, device and equipment
Technical Field
The present invention relates to the field of computers, and in particular, to a training method for an image generation model, an image generation method, an image generation device, and an apparatus.
Background
With the development of computer technology, a method of automatically generating an image using a computer device is widely used. Where objects in different images are typically combined into one image. For example, a cat in one image and a dog in another image are synthesized into a combined image such that the combined image displays the cat and the dog simultaneously. How to generate the combined image with high quality is an important point of study in the field.
At present, a method is generally adopted in which an image generation model is adopted, and information of each object to be synthesized into one image is learned separately. That is, training the image generation model through a first image where the first object is and an object description text corresponding to the first image; and training the image generation model through the second image where the second object is and the object description text corresponding to the second image. Then, the image generation model is caused to generate a combined image containing the first object and the second object by providing the object description text containing the first object and the second object to the image generation model.
However, in the above technical solution, after the image generating model is trained by using the second image in which the second object is located, the image generating model often forgets the information of the first object learned before, so that two second objects often appear in the combined image generated by using the trained image generating model, and no first object exists, that is, the generated combined image is in error and has low quality.
Disclosure of Invention
The embodiment of the application provides a training method, an image generation device and equipment for an image generation model, which can enable the image generation model to generate a high-quality combined image containing a plurality of objects. The technical scheme is as follows:
in one aspect, a training method of an image generation model is provided, the method comprising:
acquiring single object image-text pairs of a plurality of objects, wherein the single object image-text pairs of each object comprise a single object image and a single object text, the single object image of each object comprises the objects, and the single object text of each object comprises the names and the categories of the objects;
generating at least one combined image-text pair based on single object image-text pairs of the plurality of objects, wherein each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the name, the category and the position of each object in the corresponding combined image;
And training the image generation model based on the single object image-text pairs and at least one combined image-text pair of the plurality of objects.
In another aspect, there is provided an image generation method, the method including:
acquiring a target text, wherein the target text comprises names of a plurality of objects to be synthesized into one image, and the target text is used for indicating the generation requirement of the image;
processing the target text through an image generation model to obtain a combined image containing the plurality of objects, wherein the image generation model is trained based on single object image-text pairs and at least one combined image-text pair of the plurality of objects, the single object image-text pairs of each object comprise single object images and single object texts, the single object images of each object comprise the objects, the single object texts of each object comprise names and categories of the objects, each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the names, the categories and the positions of the respective objects in the corresponding combined image.
In another aspect, there is provided a training apparatus for generating a model of an image, the apparatus comprising:
The first acquisition module is used for acquiring single-object image-text pairs of a plurality of objects, wherein the single-object image-text pairs of each object comprise single-object images and single-object texts, the single-object images of each object comprise the objects, and the single-object texts of each object comprise names and categories of the objects;
the first generation module is used for generating at least one combined image-text pair based on the single object image-text pair of the plurality of objects, wherein each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the name, the category and the position of each object in the corresponding combined image;
and the training module is used for training the image generation model based on the single object image-text pairs and at least one combined image-text pair of the plurality of objects.
In some embodiments, the first generation module comprises:
the acquisition unit is used for randomly acquiring single-object image-text pairs of a preset number of objects from the single-object image-text pairs of the plurality of objects;
the synthesizing unit is used for synthesizing the single object images of the preset number of objects into an image by adopting a mapping mode to obtain a combined image;
And the generation unit is used for generating a combined text corresponding to the combined image based on the single-object text of the preset number of objects and the positions of the objects in the combined image.
In some embodiments, the synthesizing unit is configured to perform at least one of cropping and scaling on a single object image of the preset number of objects based on a preset size, where the preset size is used to represent a size of a combined image to be synthesized; and pasting the processed single object images of the preset number of objects into one image to obtain the combined image.
In some embodiments, the training module is configured to train, for a first round of training, the image generation model based on a single object image-text pair of any one of the plurality of objects; for non-first-round training, training the image generation model based on a single-object image-text pair of an incremental object, a historical image-text pair of at least one historical object and a combined image-text pair comprising the at least one historical object and the incremental object, wherein the incremental object is a newly added object in the current round training, and the at least one historical object is an object which has participated in the training.
In some embodiments, the training module comprises:
the first coding unit is used for coding a single object image of the object through the image generation model for first-round training to obtain an image feature vector;
the second coding unit is used for coding the single-object text of the object to obtain a text embedded vector, and the text embedded vector is used for guiding the image generation model to eliminate the interference of noise;
the processing unit is used for processing the image feature vector and the text embedding vector to obtain a generated image;
and the training unit is used for training the image generation model based on the difference between the generated image and the single object image of the object.
In some embodiments, the processing unit is configured to add noise to the image feature vector to obtain a first feature vector; the first feature vector is downsampled based on the text embedding vector through a downsampling module in the image generation model to obtain a second feature vector; extracting features of the first feature vector based on the text embedded vector through a bypass module in the image generation model to obtain a third feature vector, wherein the bypass module is a parameter adjustable module in the image generation model; determining the generated image based on the second feature vector and the third feature vector;
The training unit is used for adjusting parameters of the bypass module in the image generation model by taking the aim of minimizing the difference between the generated image and the single object image of the object.
In some embodiments, the apparatus further comprises:
the second acquisition module is used for generating a history text of any history object based on the name, the category and the text template of the history object; and inputting a history text of the history object into an image generation model trained based on the history object to obtain a history image of the history object.
In some embodiments, the apparatus further comprises:
the second generation module is used for generating a history combined text of the plurality of histories of the object based on names, categories and text templates of the plurality of history objects in the case that the plurality of history objects exist; inputting the history combined text into an image generation model trained based on the plurality of history objects to obtain history combined images of the plurality of history objects, wherein the text template comprises positions; alternatively, in the case where there are a plurality of history objects, history combined images of the plurality of history objects are synthesized based on the history images of the respective plurality of history objects generated by the image generation model.
In some embodiments, the training module comprises:
the first determining unit is used for determining a first loss of the image generation model based on the single-object image-text pair of the increment object for non-first-round training;
a second determining unit, configured to determine a second loss of the image generation model based on the historical graphic pair of the at least one historical object;
a third determining unit configured to determine a third loss of the image generation model based on a combined graphic pair including the at least one history object and the delta object;
and the training unit is used for training the image generation model based on the first loss, the second loss and the third loss.
In some embodiments, the third determining unit is configured to split a combined text in a combined image-text pair that includes the at least one history object and the incremental object, to obtain the at least one first text and a second text, where each first text is used to describe a position of a corresponding history object in the combined image, and the second text is used to describe a position of the incremental object in the combined image; for any text in the at least one first text and the second text, processing the text and the combined image through the image generation model to obtain a first generated image; determining a single object loss based on a difference between the first generated image and the combined image; processing the combined text and the combined image through the image generation model to obtain a second generated image; determining a combining loss based on a difference between the second generated image and the combined image; determining the third penalty for each of the single object penalty for the at least one historical object, the single object penalty for the incremental object, and the combined penalty.
In some embodiments, the third determining unit is configured to obtain, for any one of the at least one history object and the delta object, a mask image of the object, where the mask image is used to indicate a position of the object in the combined image; determining a target region in the combined image based on the mask image of the object; and determining the difference between the combined image and the first generated image at the target area to obtain a single object loss of the object.
In some embodiments, the apparatus further comprises:
a third obtaining module, configured to obtain, for any one of the at least one history object and the incremental object, a paste position of a single object image of the object in the combined image and a blank image, where the blank image has the same size as the combined image; setting a pixel value at a first area corresponding to the pasting position in the blank image as a first numerical value, wherein the first numerical value is used for indicating the position of a single object image of the object; and setting pixel values at a second area and a third area in the blank image as second numerical values to obtain the mask image, wherein the second area is other areas except the first area in the blank image, and the third area is an area in the first area and is positioned at the joint of the first area and the second area.
In another aspect, there is provided an image generating apparatus, the apparatus including:
the system comprises an acquisition module, a target text acquisition module and a display module, wherein the target text comprises names of a plurality of objects to be synthesized into one image, and the target text is used for indicating the generation requirement of the image;
the generating module is used for processing the target text through an image generating model to obtain a combined image containing a plurality of objects, the image generating model is trained based on single object image-text pairs of the plurality of objects and at least one combined image-text pair, the single object image-text pairs of each object comprise single object images and single object texts, the single object images of each object comprise the objects, the single object texts of each object comprise names and categories of the objects, each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the names, the categories and the positions of the objects in the corresponding combined image.
In another aspect, a computer device is provided, the computer device including a processor and a memory for storing at least one segment of a computer program loaded and executed by the processor to implement a training method or an image generation method of an image generation model in an embodiment of the present application.
In another aspect, a computer readable storage medium having stored therein at least one segment of a computer program loaded and executed by a processor to implement a training method or an image generation method of an image generation model as in embodiments of the present application is provided.
In another aspect, a computer program product is provided, comprising a computer program stored in a computer readable storage medium, the computer program being read from the computer readable storage medium by a processor of a computer device, the computer program being executed by the processor to cause the computer device to perform the training method or the image generation method of the image generation model provided in the above aspects or various alternative implementations of the aspects.
The embodiment of the application provides a training method of an image generation model, which comprises the steps of generating a combined image-text pair comprising at least two objects in a plurality of objects according to a single object image-text pair of the plurality of objects to be synthesized into one image before training, and training the image generation model through the single object image-text pair of each object, so that the image generation model can accurately learn the characteristics of each object, and provides a guarantee for generating a combined image for a subsequent model; and training the image generation model through at least one combined image-text pair, so that the characteristics of each object can be repeatedly learned in the aspect of object combination, the characteristics of the objects learned before the model forgets along with the training process are avoided, and the image generation model can more accurately distinguish each object by combining the position information in the text. That is, the positions of the objects in the combined image are specified through the combined text, so that the problem that the distribution of a plurality of objects in the combined image is absent can be solved by the image generation model, each object can be distinguished more accurately, the trained image generation model can more accurately synthesize a plurality of objects on one image, the aim of synthesis is met, and the quality of the combined image is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic view of an implementation environment of a training method of an image generation model according to an embodiment of the present application;
FIG. 2 is a flow chart of a training method for an image generation model provided in accordance with an embodiment of the present application;
FIG. 3 is a flow chart of another training method for an image generation model provided in accordance with an embodiment of the present application;
FIG. 4 is a schematic illustration of a combined image provided in accordance with an embodiment of the present application;
FIG. 5 is a schematic diagram of an image generation model provided in accordance with an embodiment of the present application;
FIG. 6 is a schematic diagram of a fine-tuning text word provided in accordance with an embodiment of the present application;
FIG. 7 is a flow chart of a training method for yet another image generation model provided in accordance with an embodiment of the present application;
FIG. 8 is a framework diagram of a training method for an image generation model provided in accordance with an embodiment of the present application;
FIG. 9 is a flow chart of an image generation method provided in accordance with an embodiment of the present application;
FIG. 10 is a schematic illustration of an application of an image generation model provided in accordance with an embodiment of the present application;
FIG. 11 is a block diagram of a training apparatus for image generation models provided in accordance with an embodiment of the present application;
FIG. 12 is a block diagram of a training apparatus for another image generation model provided in accordance with an embodiment of the present application;
fig. 13 is a block diagram of an image generating apparatus provided according to an embodiment of the present application;
fig. 14 is a block diagram of a terminal according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms "first," "second," and the like in this application are used to distinguish between identical or similar items that have substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the "first," "second," and "nth" terms, nor is it limited to the number or order of execution.
The term "at least one" in this application means one or more, and the meaning of "a plurality of" means two or more.
It should be noted that, information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, single object image pairs of multiple objects referred to in this application are all acquired with sufficient authorization.
In order to facilitate understanding, terms related to the present application are explained below.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
A Pre-Training Model (PTM), also called a matrix Model or a large Model, refers to a deep neural network (Deep Neural Network, DNN) with large parameters, which is trained on massive unlabeled data, and the PTM extracts common features on the data by utilizing the function approximation capability of the large-Parameter DNN, and is suitable for downstream tasks through Fine Tuning (PEFT), parameter-Efficient Fine Tuning (PEFT), prompt-Tuning and other technologies. Therefore, the pre-training model can achieve ideal effects in a small sample (Few-shot) or Zero sample (Zero-shot) scene. PTM is classified into language model, visual model, speech model, multi-modal model, etc. according to the data modality of processing. For example, the language model is ELMO (Embeddings from Language Model, a language model), BERT (Bidirectional Encoder Representations from Transformers, a bi-directional Pre-training language model), GPT (generated Pre-trained Transformer, pre-training generated model), and the like. Wherein a multimodal model refers to a model that builds a representation of two or more data modality characteristics. The pre-training model is an important tool for outputting artificial intelligence generation content (Artificial Intelligence Generated Content, AIGC), and can also be used as a general interface for connecting a plurality of specific task models. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as identifying and measuring a target by human eyes, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important reform for the development of computer vision technology, a swin-transformer (swing transformer), a ViT (Vision Transformer ), a V-MOE (Vision Mixture ofExperts, expert vision hybrid), a MAE (Masked Autoencoder, shielding automatic encoder) and other vision fields of a pre-training model can be quickly and widely applied to specific downstream tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition ), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D (three dimensional) techniques, virtual reality, augmented reality, and map construction, among others, as well as biometric recognition techniques.
According to the scheme provided by the embodiment of the application, the image generation model can be trained based on the artificial intelligence machine learning technology, and then the trained image generation model is utilized to synthesize a plurality of objects onto one image.
The Stable-diffusion (SD) model is a potential (latency) -based diffusion model that introduces text conditions (text conditions) in the U-Net to achieve the purpose of generating images based on text. The core of the stable Diffusion model derives from the principle of Latent spatial Diffusion (Latent Diffusion). The conventional diffusion model is a pixel (pixel) -based generation model, while the hidden space diffusion is a hidden space-based generation model, which compresses an image into a hidden space by using an automatic encoder (auto encoder), generates an image by using the diffusion model, and finally obtains the generated image by entering a decoding module of the automatic encoder.
The a priori class name refers to the term of the class name to which the object belongs. Such as: an image of a particular dog, the prior class name is "dog".
Examples and categories: the category refers to a category to which a certain object belongs, such as imagenet, which has 1000 common categories including cats, fish, dogs, and the like. An example refers to a specific object under a certain category, such as "Lishi" as an example of a person (category), and "I'm's wealth" as an example of a dog. Each object in the embodiments of the present application is a specific example.
A multi-instance fine-tuning generation model refers to fine-tuning a pre-trained generation model with images of multiple objects so that the generation model can perform image generation from text edited by the objects. The training method of the image generation model provided by the embodiment of the application can be regarded as a multi-instance fine tuning mode.
Incremental concept fine tuning generation model: conventional generated model fine tuning takes a graphic-text pair of an object as a training sample, so that the generated model learns the characteristics of the object, and the generated model can support image generation of the object. Incremental concept fine tuning refers to the feature of an object that continues to be learned while the model generated for that object is learned.
The training method of the image generation model provided by the embodiment of the application can be executed by computer equipment. In some embodiments, the computer device is a terminal or a server. In the following, taking a computer device as an example, an implementation environment of a training method of an image generation model provided in an embodiment of the present application will be described, and fig. 1 is a schematic diagram of an implementation environment of a training method of an image generation model provided in an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 can be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
In some embodiments, terminal 101 is, but is not limited to, a smart phone, tablet, notebook, desktop, smart speaker, smart watch, smart voice-interactive device, smart home appliance, vehicle-mounted terminal, etc. The terminal 101 installs and runs an application program supporting image processing. The application may be a clip-type application, an album, a multimedia-type application, or a communication-type application, which is not limited in this embodiment of the present application. Illustratively, the terminal 101 is a terminal used by a user. The terminal 101 is able to acquire single object pairs of multiple objects. The terminal 101 may then send the single-object teletext pairs of the plurality of objects to the server 102, where the server 102 trains the image generation model according to the single-object teletext pairs of the plurality of objects. The single object image in each single object image-text pair may be captured by the terminal 101, or may be obtained by the terminal 101 from another computer device, which is not limited in this embodiment of the present application.
Those skilled in the art will recognize that the number of terminals may be greater or lesser. Such as the above-mentioned terminals may be only one, or the above-mentioned terminals may be several tens or hundreds, or more. The number of terminals and the device type are not limited in the embodiment of the present application.
In some embodiments, the server 102 is a stand-alone physical server, can be a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. The server 102 is used to provide background services for applications that support image processing. In some embodiments, the server 102 takes on primary computing work and the terminal 101 takes on secondary computing work; alternatively, the server 102 takes on secondary computing work and the terminal 101 takes on primary computing work; alternatively, a distributed computing architecture is used for collaborative computing between the server 102 and the terminal 101.
The image generating method provided in the embodiment of the present application may also be executed in the above implementation environment, and will not be described herein again.
Fig. 2 is a flowchart of a training method of an image generation model according to an embodiment of the present application, and referring to fig. 2, an example of the training method is described in the embodiment of the present application. The training method of the image generation model comprises the following steps:
201. The server acquires single-object image-text pairs of a plurality of objects, wherein each single-object image-text pair comprises a single-object image and a single-object text, each single-object image comprises an object, and each single-object text comprises a name and a category of the object.
In the embodiment of the present application, the object may be an animal, a plant, a person, a commodity, or the like, which is not limited in the embodiment of the present application. The plurality of embodiments in the present application refers to two and more. For any object, only one of the objects is included in the single object image of the object. That is, only one object a is included in the single object image of the object a, and no other object is included. The single object image may further include a background, which is not limited in the embodiments of the present application. For any single object image, the single object text corresponding to the single object image is used to describe the objects in the single object image. The names and categories of objects in the single object image are described in the single object text. The names of the plurality of objects are different. That is, the name may be considered an identification of the object for distinguishing from other objects. The categories of the plurality of objects may be the same or different, and the embodiment of the present application does not limit this. The server may obtain the image-text pairs of the plurality of objects stored in advance from its own database, or may obtain the image-text pairs of the plurality of objects from other computer devices such as a terminal, which is not limited in this embodiment of the present application.
For example, the single-object text is "a photo of a wangcai cat on the grass", that is, "an image of a cat on a lawn", and the language to which the single-object text belongs is not limited in the embodiment of the present application. Wherein, "wangcai" is the name of an object in the single object image; "cat" is a category of an object in the single object image.
202. The server generates at least one combined image-text pair based on single-object image-text pairs of the plurality of objects, wherein each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the name, the category and the position of each object in the corresponding combined image.
In the embodiment of the application, for any at least two objects in a plurality of objects, the server acquires a single-object image-text pair of the at least two objects from a single-object image-text pair of the plurality of objects. And the server synthesizes the single object image-text pair of the at least two objects into a combined image-text pair. That is, the server synthesizes a combined image from the single object images of at least two objects. The server generates a combined text according to names, categories in the single object text of at least two objects and positions of the objects in the combined image. The embodiment of the application does not limit the number of the objects in the combined image-text pair.
For example, the single object text of cat prosperity is "a photo of a wangcai cat on the grass", and the single object image is an image of cat prosperity on grass. The single object text of the bear three is a photo of axiongsan bear on the grass, and the single object image is an image of the bear three on the grass. The server generates a combined image and a combined text based on the single-object image-text pair of the panda and the single-object image-text pair of the panda. In the combined image, the small cat is on the left and the bear is on the right. The combined text is "a photo of wangcai cat on the left, xiongsan bear on the right. "
203. The server trains the image generation model based on single object image-text pairs and at least one combined image-text pair of the plurality of objects.
In the embodiment of the application, the server trains the image generation model by taking a single object image-text pair and at least one combined image-text pair of a plurality of objects as training samples. For any object, the server performs model training through a single object image-text pair of the object so that the image generation model can learn the characteristics of the object. For a plurality of objects, the server performs model training on the combined image-text pairs corresponding to the objects so that the image generation model can learn the differences among the objects. The trained image generation model is used to generate a combined image of any at least two of the plurality of objects.
The embodiment of the application provides a training method of an image generation model, which comprises the steps of generating a combined image-text pair comprising at least two objects in a plurality of objects according to a single object image-text pair of the plurality of objects to be synthesized into one image before training, and training the image generation model through the single object image-text pair of each object, so that the image generation model can accurately learn the characteristics of each object, and provides a guarantee for generating a combined image for a subsequent model; and training the image generation model through at least one combined image-text pair, so that the characteristics of each object can be repeatedly learned in the aspect of object combination, the characteristics of the objects learned before the model forgets along with the training process are avoided, and the image generation model can more accurately distinguish each object by combining the position information in the text. That is, the positions of the objects in the combined image are specified through the combined text, so that the problem that the distribution of a plurality of objects in the combined image is absent can be solved by the image generation model, each object can be distinguished more accurately, the trained image generation model can more accurately synthesize a plurality of objects on one image, the aim of synthesis is met, and the quality of the combined image is improved.
Fig. 3 is a flowchart of another training method of an image generation model according to an embodiment of the present application, referring to fig. 3, in the embodiment of the present application, an example will be described in the case of being executed by a terminal. The training method of the image generation model comprises the following steps:
301. the server acquires single-object image-text pairs of a plurality of objects, wherein each single-object image-text pair comprises a single-object image and a single-object text, each single-object image comprises an object, and each single-object text comprises a name and a category of the object.
In an embodiment of the present application, for any object, a server obtains at least one single object image of the object. The embodiment of the application does not limit the number of single-object images of each object. The objects in the single object image of the same object are the same, but the pose of the object in the image, the background of the image, and other contents may be the same or different, which is not limited in the embodiment of the present application. For any single object image, the server can identify the single object image to obtain a single object text corresponding to the single object image. Alternatively, the single-object text of each single-object image may also be customized by the user; accordingly, the server can directly acquire the single-object text customized by the user. The embodiment of the application does not limit the acquisition mode of the single-object text. Alternatively, the server may also automatically generate a corresponding single-object image from the single-object text. Wherein each object is an instance. The names of objects may be referred to as "concepts". Accordingly, the single object image may also be referred to as a "single conceptual image". The single object-text pair may also be referred to as a "single concept sample".
302. The server generates at least one combined image-text pair based on single-object image-text pairs of the plurality of objects, wherein each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the name, the category and the position of each object in the corresponding combined image.
In the embodiment of the application, before training the image generation model, the server may generate at least one combined image-text pair in advance according to the single-object image-text pairs of the plurality of objects. That is, for a plurality of objects to be combined into one image, the server may first combine single object images of the plurality of objects into one image, thereby participating in training as a sample. Or in the training process, after training through the single-object image-text pair of a certain object, the server can also generate a combined image-text pair in real time according to the single-object image-text pair of the object which has participated in training and the single-object image-text pair of the newly added object to train. The embodiment of the application does not limit the time for generating the combined image-text pair.
In some embodiments, the server generates the combined pairs of graphics using a map approach. Correspondingly, the process of generating at least one combined image-text pair by the server based on the single-object image-text pairs of the plurality of objects comprises the following steps: the server randomly acquires single-object image-text pairs of a preset number of objects from the single-object image-text pairs of the plurality of objects. Then, the server synthesizes single object images of a preset number of objects into one image by adopting a mapping mode, and a combined image is obtained. Then, the server generates a combined text corresponding to the combined image based on the single object text of the preset number of objects and the positions of the objects in the combined image. The preset number may be any value greater than 1, which is not limited in the embodiment of the present application. According to the scheme provided by the embodiment of the application, the single object images with the preset number of objects are pasted in the same image in a mapping mode, so that the combined image can accurately contain the preset number of objects, and the characteristics of the preset number of objects can be accurately learned when model training is carried out through the combined image; and the combined text is generated by combining the positions of the objects in the image, so that each object can be more accurately distinguished when model training is carried out through the combined text, and the image generation model can generate a combined image which meets the purpose of synthesis.
For example, the plurality of objects includes object a, object b, and object c. The server can generate a double-object image-text pair of the object a and the object b according to the single-object image-text pair of the object a and the single-object image-text pair of the object b. The server can generate a double-object image-text pair of the object a and the object c according to the single-object image-text pair of the object a and the single-object image-text pair of the object c. The server can generate a double-object image-text pair of the object b and the object c according to the single-object image-text pair of the object b and the single-object image-text pair of the object c. The server can generate three object image-text pairs of the object a, the object b and the object c according to the single object image-text pair of the object a, the single object image-text pair of the object b and the single object image-text pair of the object c.
In the process of generating the combined image by adopting the mapping mode, the server can perform at least one of cutting and scaling on single object images of a preset number of objects based on a preset size. Then, the server pastes the processed single object images of the preset number of objects into one image to obtain a combined image. The preset size is used for representing the size of the combined image to be synthesized. According to the scheme provided by the embodiment of the application, at least one of cutting and scaling is performed on the single object image according to the size requirement of the combined image to be synthesized, and then the processed single object image is pasted into one image, so that the accuracy of the combined image is ensured; that is, the combined image is made to meet the requirements of model processing, facilitating the subsequent image generation model processing.
For example, for any one of a preset number of objects to be synthesized into one image, the server randomly extracts one single object image from among a plurality of single object images of the object. Then, the server adjusts the edge of the single object image to 1/4 to 1/2 of the original edge. The size of the adjusted single object image is 128-256. Then, the server pastes the adjusted single object images of the preset number of objects onto the 256x256 image to obtain a combined image.
In some embodiments, the difficulty of traversing all objects to generate a combined image is considered as the number of objects becomes greater. When the number of objects is greater than a preset value (for example, preset value=5), the server may randomly perform 5 times for each object to be combined with different other objects, thereby performing generation of a two-object image; then, the combination with 2 different other objects is randomly performed 3 times, thereby generating a three-object image. The number of combinations can be adjusted according to the training time requirement, and the larger the data is, the longer the training time is, so that the embodiment of the application does not limit the number of combinations.
In some embodiments, in generating the combined image, the server obtains mask images (masks) of the respective objects. The mask image is used to indicate the position of the object during the combined image based training process. That is, the mask image is used to indicate the portion of the combined image based training process that requires a computational penalty. Accordingly, for any one of the objects in the combined image, the server acquires the paste position and the blank image of the single object image of the object in the combined image. The blank image is the same size as the combined image. Then, the server sets a pixel value at a first area corresponding to the paste position in the blank image to a first numerical value. The first value is used to indicate the location of the single object image of the object. The server sets pixel values at the second region and the third region in the blank image to a second numerical value, resulting in a mask image. The second area is other areas except the first area in the blank image. The third region is a region in the first region and is positioned at the joint of the first region and the second region. That is, the third region is located at the inner edge of the first region. The size of the third region is not limited in the embodiment of the present application. The first value and the second value are not limited in this embodiment. According to the scheme provided by the embodiment of the application, the mask images of the objects are acquired, so that the loss of the image generation model learning object can be calculated according to the mask images in the subsequent training process based on the combined images, and the accuracy of the characteristics of the image generation model learning object is improved; moreover, the method does not calculate the loss of the inner edge, can inhibit the edge overfitting of the single object image in the combined image so as to weaken the edge, is beneficial to no obvious edge in the combined image generated by the image generation model, and has natural synthesis effect and no spliced or stuck boundary sense.
For example, the first value is 1 and the second value is 0. That is, for any object, the pixel value at the position of the mask image of the object, where the single object image of the object is not pasted, is 0; the pixel values at the positions of the single object image to which the object is attached are all 1. Correspondingly, the position corresponding to 1 in the combined image needs to calculate the loss; the position corresponding to 0 does not require calculation loss. In addition to the pasted single object image itself, the server may set the pixel value at the position of the mask image corresponding to the inner edge of the single object image to 0. The inner edge is the third region. The width of the inner edge may be 3 pixels, which is not limited in this embodiment. By setting the pixel value of the third region to 0, the loss of the inner edge of the single object image pasted in the combined image is not required to be calculated later, so that the edge of the single object image in the combined image is weakened, and the image generation model is facilitated to generate a more natural combined image.
In the process of generating the combined image, the overlapping area between any two pasted single object images does not exceed the target threshold value, so that the influence on the characteristics of the model learning object caused by overlarge overlapping area is avoided. Alternatively, the edges of any two pasted single object images overlap by no more than 20%. "edge overlap" refers to the occlusion of an edge of one single object image by or within another single object image. That is, when there is an overlap region between any two images, there is an overlap between the edges of those two images. The server may also set the pixel values in the mask image corresponding to the overlapping region between the single object images to a second value so that no calculation of the loss of the overlapping region is required subsequently.
In addition to each object having its own mask image, the server may also obtain the mask image of the combined image. The pixel values of the non-inner edge and non-overlapping area of each single object image in the mask image of the combined image are all first values, and the pixel values of other areas are all second values.
For example, fig. 4 is a schematic diagram of a combined image provided according to an embodiment of the present application. Referring to fig. 4, (a) a combined image of one kitten on the left side and one bear on the right side in fig. 4. The server may acquire three mask images from the combined image. The three mask images are respectively a mask image of a kitten (the pixel values at the non-inner edge of the kitten image 401 are all 1, the other areas including the position corresponding to the kitten image 402 are all 0), a mask image of a bear, and a mask image of a combined image (the pixel values of the non-inner edge of the kitten image 401 and the non-inner edge of the kitten image 402 are all first values, and the pixel values of the other areas are all second values). Fig. 4 (b) is a combined image of one cat on the left, one on the right, and one on the bottom. There is an overlapping area between the deer image 403 and the bear image 402. It can be seen that there is an edge overlap between the deer image 403 and the bear image 402. The server may acquire four mask images from the combined image. The three mask images are respectively a mask image of a cat (the pixel values at the non-inner edges of the cat image 401 are all 1, the other areas including the position corresponding to the small Xiong Tuxiang and the position corresponding to the small deer image 403 are all 0), a mask image of a bear, a mask image of a deer, and a mask image of a combined image (the pixel values of the non-inner edges and non-overlapping areas of the single-object images of three objects are all first values, and the pixel values of the other areas are all second values).
In the embodiment of the application, the server can perform model training by adopting a training mode of learning one object at a time. That is, the image generation model learns the features of one object at a time. Alternatively, the server may also perform model training using a training approach that learns multiple objects at a time. That is, the image generation model learns the features of a plurality of objects at a time. The number of objects learned each time is not limited by the embodiment of the application. The following description will mainly take as an example that the image generation model learns the features of one object at a time. The principle of learning an image generation model is the same whether the features of one object are learned each time or the features of a plurality of objects are learned each time, and repeated description is omitted.
303. For the first round of training, the server trains the image generation model based on a single object image-text pair of any one object of a plurality of objects.
In the embodiment of the application, for first-round training, the server randomly acquires a single-object image-text pair of any object from single-object image-text pairs of a plurality of objects. Then, the server inputs the single-object image-text pairs into an image generation model, and trains the image generation model. That is, for the first training, the server encodes a single object image of the object by means of an image generation model, and obtains an image feature vector. And the server encodes the single-object text of the object to obtain a text embedded vector. The text embedded vector is used to guide the image generation model to exclude noise interference. Then, the server processes the image feature vector and the text embedding vector to obtain a generated image. The server then trains the image generation model based on the gap between the generated image and the single object image of the object. According to the scheme provided by the embodiment of the application, the image generation model is guided to eliminate noise interference in the single-object image through the text embedded vector of the single-object text, so that the image generation model can learn the characteristics of the object more accurately, and further guarantee is provided for the follow-up generation of the combined image containing the object.
In the process of processing the image feature vector and the text embedded vector to obtain a generated image, the server can add noise to the image feature vector to obtain a first feature vector. Then, the server performs downsampling on the first feature vector based on the text embedded vector through a downsampling module in the image generation model to obtain a second feature vector. And the server performs feature extraction on the first feature vector based on the text embedded vector through a bypass module in the image generation model to obtain a third feature vector, wherein the bypass module is a parameter adjustable module in the image generation model. The server then determines to generate an image based on the second feature vector and the third feature vector. The server then adjusts parameters of the bypass module in the image generation model with the goal of minimizing the gap between the generated image and the single object image of the object.
The server may perform weighted summation on the second feature vector and the third feature vector to obtain a downsampled feature vector. Then, the server upsamples the downsampled feature vector based on the text embedding vector through an upsampling module in the image generation model to obtain a generated image. The server then adjusts the parameters of the bypass module in the image generation model by generating a gap between the image and the single object image of the object.
In some embodiments, the server may calculate the gap between the generated image and the single object image of the object by the following equation one.
Equation one:
wherein MSE is used to represent the gap between a generated image and a single object image of an object, which may also be referred to as the image generation model learning the loss of the object; y is i A single object image for representing an object;for representing a generated image; n is used to represent the total number of pixels; i denotes for the ith pixel.
For example, fig. 5 is a schematic diagram of an image generation model provided according to an embodiment of the present application. Referring to fig. 5, the image generation model includes an image encoding module, a Text encoding module (Text encoding module), a bypass module, and a denoising module. 1 random seed was generated for each image-text pair. The random seed is used to add noise to the images that participate in model training. The server may employ a diffusion process (Diffusion process) to add noise corresponding to the random seed to the image. The server encodes the image added with noise through an image encoder in the image encoding module so as to convert the image added with noise into an image feature vector in the hidden space. Alternatively, the server may convert the image into the image feature vector in the hidden space by the image encoder, and then add noise, which is not limited in the embodiment of the present application. The server encodes the Text by a Text encoding module in an encoding mode of CLIP (Contrastive Language-Image Pre-Training and contrast language Image Pre-Training) to obtain Text Embedding vectors (Text Embedding). Then, the server inputs the text embedded vector into a denoising module to control the denoising process through a KV assignment mode of QKV. The server may also copy a KV map of the downsampling process to the bypass module. The bypass module outputs QKV the calculation (new feature) with the same text-embedded vector activated. And the new features and the results obtained by the weighted sum of the features of the corresponding positions of the original downsampling process replace the features of the corresponding positions of the original downsampling process. The server then upsamples the features of the downsampled output to predict noise. Then, the predicted features are subtracted on the basis of the added image feature vector, thereby restoring the input image (generated image) by an image decoder in the image encoding module. The server then adjusts parameters of the bypass module in the image generation model with the goal of minimizing the gap between the generated image and the input image.
The image encoding module may be a VAE (Variational Autoencoder, variable automatic encoder), the bypass module may be a Low-order adaptive (Low-Rank) model, and the denoising module may be a U-Net model, which is not limited in this embodiment of the present application. The image coding module, the text coding module and the denoising module can adopt parameters of a model (stable-diffusion v 1-5) with trained open source. That is, the image generation model provided by the embodiment of the present application may be constructed based on a stabiedifusion model. The server may also replace the stabledifusion model with other open-source text graph diffusion models, which are not limited by the embodiments of the present application. The bypass module may be initialized with the corresponding parameters in the denoising module, and only the bypass module is updated in training, and the other modules are not updated in training. That is, embodiments of the present application train by trimming QKV in the bypass module. This approach can be considered as a training approach for lightweight fine tuning. The server can also directly fine tune the U-Net structure, and training can also be performed by adopting textual Inversion (text inversion) or LoRA and other fine tuning methods, and the training mode is not limited by the embodiment of the application. The period learning rate is set to 1e-5, which is not limited in the embodiment of the present application. The training is carried out on the whole sample by each round of training, and the training times are not limited in the embodiment of the application. For example, each round is trained 10 times, for a total of 100 rounds.
In some embodiments, since the image generation model is text-based to generate images during application, the performance of the image generation model to understand text also needs to be improved. That is, in order to improve the effect of learning text by the image generation model, the scheme trains from the text characterization end and the model structure end. The text characterization end refers to the portion of the image generation model that extracts the text embedded vector. The model structure end refers to the part of the image generation model that removes noise in the image through the bypass module. The text characterization end trains the newly learned text words. The text word refers to the name of the object. No fine tuning is needed for other text words because there is no specialized data. And saving the trimmed text words in an object library. Since the fine tuning of multiple objects requires training multiple text words, the same text word cannot be used each time, the text word of the target object for each training is first named by the user as a candidate (e.g., xiongsan), and when the user is named already occupied, "n" is added subsequently to represent the word (e.g., xiongsan1, xiongsan2, etc.).
Wherein, each text word to be trimmed is initialized by using a category, namely, a bear text token is initialized for xiongsan (Xiong San). For each object, 3 empdding combination characterizations (adjustable to 2-5) can be used. Only the representation of text words during training requires updating parameters. The training samples are added in the mode, so that the generalization capability of the image generation model is improved.
For example, fig. 6 is a schematic diagram of a fine-tuning text word provided according to an embodiment of the present application. Referring to fig. 6, the single object text is "a photo of a xiongsan bear". The server encodes the single-object text through a text encoder to obtain a text embedded vector. The text embedding vector has a size of 77×768. The text embedding vector may be divided into three parts. Part represents text words (names of objects), part represents categories of objects, and the rest represents other parts in the single-object text. The server then updates the vector representation of the text word by model fine-tuning. The server then saves the updated text word in the object library. For example, the object library includes the following three fine-tuned text words, wangcai-cat- [ emb3, emb4, emb5]; xiongsan-bear- [ emb0, emb1, emb2]; banbi-down- [ emb6, emb7, emb8].
304. For non-first-round training, the server trains the image generation model based on single-object image-text pairs of incremental objects, historical image-text pairs of at least one historical object and combined image-text pairs comprising at least one historical object and the incremental objects, wherein the incremental objects are newly added objects in the current round training, and the at least one historical object is an object which has participated in the training.
In the embodiment of the application, for non-first-round training, a server acquires a single object image-text pair of an incremental object, a historical image-text pair of at least one historical object and a combined image-text pair comprising at least one historical object and the incremental object. Then, for any image-text pair, the server processes the image-text pair through an image generation model; and calculates the loss to enable training of the image generation model. Step 304 may be implemented by the following steps 3041 to 3045. Referring to fig. 7, fig. 7 is a flowchart of a training method of still another image generation model according to an embodiment of the present application.
3041. For non-first-round training, the server acquires a single object image-text pair of the incremental object, a history image-text pair of at least one history object, and a combined image-text pair comprising at least one history object and the incremental object.
The server acquires single-object image-text pairs of the increment objects from the single-object image-text pairs of the plurality of objects. The server may obtain, from the single-object image-text pairs of the plurality of objects, a single-object image-text pair of at least one historical object as a historical image-text pair. Alternatively, the server may also generate a historical graphic pair of the historical object by generating a model based on the image.
In some embodiments, the process of obtaining the historical teletext pairs of the at least one historical object includes: for any of the history objects, the server generates a history text for the history object based on the name, category, and text template of the history object. Then, the server inputs the history text of the history object into the image generation model trained based on the history object, and obtains the history image of the history object. The text template is not limited by the embodiment of the application. According to the scheme provided by the embodiment of the application, a new single-object image is generated according to the name, the category and the text template of the historical object through the image generation model; the method is beneficial to training an image generation model through a single object image-text pair of a history object, enriches training samples, avoids the repeated use of the same training samples, and can improve the generalization capability of the model.
For example, there are a total of 5 text templates, template 1"a photo of a*on the grass", template 2"a photo of a*on bed", template 3"cute x in sofa", template 4"cute*in style of cartoon" and template 5"a painting of a", respectively. Then, the server replaces "×" in the text template with the name and category of the history object to obtain the history text. Such as historical text 1"a photo of a wangcai cat on the grass" and historical text 2"a photo of a xiongsan bear on the grass", etc. The server then obtains a target number of random seeds for each history text. The present embodiment does not limit the target number, for example, target number=2. Then, for any random seed, the server processes the historical text and the random seed through an image generation model to obtain a historical image. The server may record elements that participate in the generation of the historical images. Such as: id1-seed999-image1, id1-seed10-image2, etc. Where "id1" is used to represent the text template used to participate in the generation of the history image. "seed999" is used to denote the random seed used to participate in the generation of the historical image. "image1" is an identification of the generated history image. Then, the history image and the history text of the history object form a history image-text pair. The historical teletext pairs may also be referred to as regularized data.
In some embodiments, where there are multiple history objects, the server generates a history combined text for the multiple history objects based on the names, categories, and text templates of the multiple history objects. Then, the server inputs the history combined text into the image generation model trained based on the plurality of history objects, and obtains a history combined image of the plurality of history objects. The text template includes a location therein.
For example, the text template is "a photo of 1on the left," 2on the right. Then, the server replaces the '1' and '2' in the text template with the names and the categories of the historical objects to obtain the historical text. Such as the history combined text "a photo of wangcai on the left, xiongsan bear on the right". The server then obtains a target number of random seeds for each history combined text. The present embodiment does not limit the target number, for example, target number=1. Then, for any random seed, the server processes the history combined text and the random seed through an image generation model to obtain a history combined image. The server may record elements that participate in the generation of the history combined image. Such as: id1-seed686-image1, etc. Where "id1" is used to represent the text template used to participate in the generation of the history combined image. "seed686" is used to denote the random seed used to participate in the generation of the historical combined image. "image1" is an identification of the generated history combined image. Then, the history combined image and the history combined text form a combined image-text pair of a plurality of history objects.
In some embodiments, in the case where there are a plurality of history objects, the server synthesizes a history combined image of the plurality of history objects based on the history images of the respective plurality of history objects generated by the image generation model. That is, the server may use a mapping method to synthesize the history images of the plurality of history objects into one image, thereby obtaining a history combined image of the plurality of history objects. The principle of this manner is the same as that adopted in step 302, and will not be described here again.
In the process of learning the characteristics of a plurality of objects each time by the image generation model, a plurality of incremental objects and a plurality of historical objects are combined one by one to obtain a plurality of combined images; the server may also combine the plurality of delta objects with one another to obtain a plurality of combined images.
3042. For non-first round training, the server determines a first penalty of the image generation model based on the single object-graph pairs of the incremental objects.
In this step, the server calculates the first loss in the same manner as the principle of calculating the gap between the generated image and the single-object image in step 303. That is, the server may calculate the first loss through the first formula, which is not described herein.
3043. The server determines a second penalty for the image generation model based on the historical map-text pair of the at least one historical object.
In this step, the server calculates the second loss in the same manner as the principle of calculating the difference between the generated image and the single-object image in step 303. That is, the server may calculate the second loss through the first formula, which is not described herein.
3044. The server determines a third penalty for the image generation model based on the combined teletext pairs comprising the at least one history object and the delta object.
The server splits the combined text in the combined image-text pair containing at least one historical object and an incremental object to obtain at least one first text and at least one second text. Each first text is used to describe the location of a corresponding history object in the combined image. The second text is used to describe the position of the delta object in the combined image. And for any text in at least one first text and at least one second text, the server processes the text and the combined image through the image generation model to obtain a first generated image. The server then determines a single object loss based on the difference between the first generated image and the combined image. And the server processes the combined text and the combined image through the image generation model to obtain a second generated image. The server determines a combining loss based on a difference between the second generated image and the combined image. The server determines a third penalty for each of the single object penalty, the single object penalty for the delta object, and the combined penalty for the at least one history object.
For example, the server obtains a random image (noise figure) from a random seed by gaussian sampling. The server then adds a random image to the combined image. The combined text carries the location information. The server extracts text embedded vectors of the combined text through an image generation model. When an object in the combined text contains a hinted name (e.g., xiongsan) at image generation, the name is replaced with a hinted feature representation. And respectively performing forward calculation on the multiple split texts of the combined text, namely respectively inputting the multiple split texts into an image generation model. For example, "a photo of the left" extracts the embedded vector (if ". Times.1" is a historical object, the object name needs to be replaced by a trained feature representation), and inputs the extracted embedded vector into the image generation model. And, the plurality of texts are all input by the same combined image: each of which generates 1 random number, adds the random numbers to the combined image, inputs the combined image, and performs model forward calculation to obtain a third loss of the image generation model.
In some embodiments, the server determines a single object loss based on a difference between the first generated image and the combined image, comprising: for any of the at least one history object and delta object, the server obtains a mask image of the object. The mask image is used to indicate the location of the object in the combined image. The server then determines a target area in the combined image based on the mask image of the object. The server then determines the difference between the combined image and the first generated image at the target area, resulting in a single object loss for the object. According to the scheme provided by the embodiment of the application, the difference between the generated image and the input combined image is filtered through the mask image, so that the purpose of only calculating the loss of the object in the generated image and not concerning the background area is realized, the background overfitting is avoided, the image generation model can be focused on the characteristics of the learning object, and the object is synthesized into one image.
Wherein, the generation process of the mask image comprises the following steps: for any one of at least one history object and the increment object, the server acquires a pasting position of a single object image of the object in the combined image and a blank image, wherein the blank image has the same size as the combined image. Then, the server sets a pixel value at a first area corresponding to the paste position in the blank image to a first numerical value. The first value is used to indicate the location of the single object image of the object. Then, the server sets pixel values at the second region and the third region in the blank image to a second numerical value, resulting in a mask image. The second area is other areas except the first area in the blank image, and the third area is an area in the first area and is positioned at the joint of the first area and the second area.
For example, for "a photo of wangcai cat on the left" a mask image of a kitten is used, and the pixel-by-pixel MSE loss of the combined image of the generated image and the input is filtered, i.e., only the pixel loss at the location in the mask image where the pixel value is 1 is retained. Filtering MSE loss of the generated image and the input combined image pixel by using a mask image of a little bear for xiongsan bear on the right, namely only preserving pixel loss of a position with a pixel value of 1 in the mask image; for the combined text "a photo of wangcai cat on the left, xiongsan bear on the right", the mask image of the combined image is used to filter the MSE loss of the generated image and the input combined image pixel by pixel, i.e. only the pixel loss at the position in the mask image where the pixel value is 1 is retained.
3045. The server trains the image generation model based on the first loss, the second loss, and the third loss.
The server performs weighted summation on the first loss, the second loss and the third loss to obtain the total loss of the image generation model. The server then trains the image generation model with the goal of minimizing the total loss.
In some embodiments, the server may calculate the total loss of the image generation model using equation two.
Formula II:
Loss=Loss1+0.05*Loss2+0.2*(0.2Loss3 1 +0.2Loss3 2 +0.6Loss3 3 )
wherein Loss is used to represent the total Loss of the image generation model; loss1 is used to represent a first Loss; loss2 is used to represent a second Loss; (0.2 Loss3) 1 +0.2Loss3 2 +0.6Loss3 3 ) For representing a third loss; loss3 1 Single object loss for representing historical objects; loss3 2 Single object loss for representing delta objects; loss3 3 For indicating the combined loss.
In order to more clearly describe the training method of the image generation model provided by the embodiment of the application, the training method of the image generation model is further described below with reference to the accompanying drawings. Fig. 8 is a frame diagram of a training method of an image generation model according to an embodiment of the present application. Referring to fig. 8, the server may acquire a single object image and a single object text input by a user. Then, the server uses the single-object image and the single-object text input by the user as a single-object image-text pair of the increment object to be learned by the image generation model. Then, the server trains the image generation model through single object image-text pairs of the increment objects. The server may store single object-graph pairs of the delta objects in the object graph set. The object graph set includes historical objects learned by the image generation model. In the model training process, the server can update parameters of the image generation model from single object image-text pairs of the increment objects, single object image-text pairs of the history objects in the object atlas and combined image-text pairs of the increment objects and the history objects input by the user. The server then inputs the text word entered by the user into the updated image generation model to generate an object containing the text word by the image generation model. The text words may be referred to as hint words that are used to hint objects that are required to be included in the image generated by the image generation model.
The embodiment of the application provides a training method of an image generation model, which comprises the steps of generating a combined image-text pair comprising at least two objects in a plurality of objects according to a single object image-text pair of the plurality of objects to be synthesized into one image before training, and training the image generation model through the single object image-text pair of each object, so that the image generation model can accurately learn the characteristics of each object, and provides a guarantee for generating a combined image for a subsequent model; and training the image generation model through at least one combined image-text pair, so that the characteristics of each object can be repeatedly learned in the aspect of object combination, the characteristics of the objects learned before the model forgets along with the training process are avoided, and the position information in the text can be combined, so that each object can be more accurately distinguished by the image generation model, and the characteristics of each object can be more accurately learned. That is, the positions of the objects in the combined image are specified through the combined text, so that the problem that the distribution of a plurality of objects in the combined image is absent can be solved by the image generation model, each object can be distinguished more accurately, the trained image generation model can more accurately synthesize a plurality of objects on one image, the aim of synthesis is met, and the quality of the combined image is improved.
The image generation method provided by the embodiment of the application can be executed by computer equipment. In some embodiments, the computer device is a terminal or a server. Fig. 9 is a flowchart of an image generating method according to an embodiment of the present application. Referring to fig. 9, an example of execution by a server is described in the embodiment of the present application. The image generation method comprises the following steps:
901. the server acquires a target text, wherein the target text comprises names of a plurality of objects to be synthesized into one image, and the target text is used for indicating the generation requirement of the image.
902. The method comprises the steps that a server processes a target text through an image generation model to obtain a combined image containing a plurality of objects, the image generation model is trained based on single object image-text pairs of the plurality of objects and at least one combined image-text pair, each single object image-text pair of the object comprises a single object image and a single object text, each single object image of the object comprises an object, each single object text of the object comprises a name and a category of the object, each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the name, the category and the position of each object in the corresponding combined image.
For example, the target text is "Wealth-keeping deviceAndxiong SanSitting on the grass. Wherein the underline is the name of the object in the target text. Then, for the name of any object in the target text, the server queries the name identifier and category corresponding to the underlined name. The embodiment of the application does not limit the name identifier. The name identifier may be an english letter string. Such as xiongsan. The server searches the feature representation corresponding to the letter string, so that the feature representation of the name + the feature representation of the category is obtained. Wherein "+" indicates stitching. If the characteristic representation of the name includes 3 emmbedding, the characteristic representation of the category is 1 emmbedding, then 4 emmbedding after stitching. The server may then invoke a translation interface and replace the names in the sentence with name tags, which then results in a translation. Translation result "A wangcai cat san a xiongsan bear are on the gradland." this step can be changed to a chatGPT translation, and the translation mode is not limited in the embodiment of the present application. And then, the server extracts the characteristics of the translation result and replaces the characteristics of the name identification in the translation result with the trimmed characteristics. The server then generates a combined image using the image generation model based on the replaced features. The server may then send the combined image to the terminal to present the generated combined image to the user via the terminal.
In order to more clearly describe the image generation method provided in the embodiment of the present application, the image generation method is further described below with reference to the accompanying drawings. Fig. 10 is a schematic diagram of an application of an image generation model according to an embodiment of the present application. Referring to fig. 10, a user submits a single object graphic pair of "white kitten" to an image generation system through a target application. Then, the server trains an image generation model in the image generation system through a single object image-text pair of a white cat. The trained image generation model may generate images containing "white kittens". For example, a series of images of a "white kitten" traveling are generated. Then, the user submits the single-object graphic-text pair of the young bear to the image generation system through the target application. Then, the server trains an image generation model in the image generation system through a single object image-text pair of the young bear. The trained image generation model may generate images containing "white kittens" and "young bear figures". For example, a series of "kittens+bear" shots are generated. Then, the user submits a single object image-text pair of crying face suit deer to the image generation system through the target application. Then, the server trains the image generation model in the image generation system again through a single object image-text pair of crying face suit deer. The trained image generation model may generate an image containing "cat+bear+deer".
The application scenario of the image generating method is not limited in this embodiment, and two application scenarios are described below by way of example, but not limited in any way.
Scene one, the family member digital portraits generate the scene. As under the user personalized trim model, there is already a user's own trim model. "user's own fine tuning model" refers to the image generation model having learned the user's own features. And then continuing to fine-tune other family members or a plurality of pets and other objects, adopting the method of the scheme to continuously and incrementally fine-tune the names (concepts) of different objects into the image generation model, and finally supporting the creation of simultaneous generation of a plurality of objects such as a history object (a user himself) and an incremental object (such as pets and other family members). Such as generating photos of me and wealth group photo on sofa.
Scene two, e-commerce individuation collocation scene. The method provided by the embodiment of the application provides the generating capability for the user of the electric business, and can support the seller to continuously go up under the new product and combine the new product and the old product to generate the product preview image. For example, a user provides a robot in the shape of a mouse, and the robot mouse x fine tuning model is used for generating a product display diagram, such as 'a robot mouse x stands on a table'; the user then goes to another product: and (3) performing incremental model fine adjustment on the tiger robot y under the scheme to obtain a new model to support the generation of historical concepts and new concepts, such as 'a machine mouse x and the tiger robot y stand on a table side by side'. So as to better preview the matching effect of the target commodity in specific occasions, such as supporting the user to provide references for the additional purchase (such as the mouse is better seen with the cow or the tiger, etc.) under the condition of the existing commodity (such as the mouse is purchased), and assist the user in decision.
According to the scheme provided by the embodiment of the application, the positions of the objects in the combined image are specified through the combined text, so that the problem that the distribution of a plurality of objects in the combined image is absent can be solved by the image generation model, each object can be distinguished more accurately, and the trained image generation model can more accurately synthesize the plurality of objects on one image, so that the quality of the combined image generated through the trained image generation model is better, namely, the combined image accurately contains the objects to be synthesized, the synthesis effect is natural, and the boundary sense of splicing or pasting is avoided.
Fig. 11 is a block diagram of a training apparatus for generating a model from an image according to an embodiment of the present application. Referring to fig. 11, the training apparatus of the image generation model includes: a first acquisition module 1101, a first generation module 1102, and a training module 1103.
A first obtaining module 1101, configured to obtain single-object image-text pairs of a plurality of objects, where each single-object image-text pair of an object includes a single-object image and a single-object text, and each single-object image of an object includes an object, and each single-object text of an object includes a name and a category of the object;
A first generating module 1102, configured to generate at least one combined image-text pair based on a single object image-text pair of a plurality of objects, where each combined image-text pair includes a combined image and a combined text, each combined image includes at least two objects of the plurality of objects, and each combined text includes a name, a category, and a position of each object in the corresponding combined image;
the training module 1103 is configured to train the image generation model based on the single object image-text pairs and the at least one combined image-text pair of the plurality of objects.
In some embodiments, fig. 12 is a block diagram of another training apparatus for image generation models provided in accordance with an embodiment of the present application. Referring to fig. 12, the first generating module 1102 includes:
an obtaining unit 11021, configured to randomly obtain single-object image-text pairs of a preset number of objects from single-object image-text pairs of a plurality of objects;
a synthesizing unit 11022, configured to synthesize a single object image of a preset number of objects into an image by using a mapping manner, so as to obtain a combined image;
a generating unit 11023, configured to generate a combined text corresponding to the combined image based on the single-object text of the preset number of objects and the positions of the objects in the combined image.
In some embodiments, with continued reference to fig. 12, the synthesizing unit 11022 is configured to at least one of crop and scale a single object image of a preset number of objects based on a preset size, where the preset size is used to represent a size of a combined image to be synthesized; and pasting the processed single object images of the preset number of objects into one image to obtain a combined image.
In some embodiments, with continued reference to fig. 12, a training module 1103 is configured to train, for a first round of training, the image generation model based on a single object-graph pair of any one of the plurality of objects; for non-first-round training, training an image generation model based on a single object image-text pair of an incremental object, a historical image-text pair of at least one historical object and a combined image-text pair comprising at least one historical object and the incremental object, wherein the incremental object is a newly added object in the current round training, and the at least one historical object is an object which has participated in the training.
In some embodiments, with continued reference to fig. 12, training module 1103 includes:
a first coding unit 11031, configured to code a single object image of an object through an image generation model for first-round training, so as to obtain an image feature vector;
A second encoding unit 11032, configured to encode a single-object text of an object to obtain a text embedding vector, where the text embedding vector is used to guide the image generation model to eliminate noise interference;
a processing unit 11033, configured to process the image feature vector and the text embedding vector to obtain a generated image;
the training unit 11034 is configured to train the image generation model based on a gap between the generated image and the single object image of the object.
In some embodiments, with continued reference to fig. 12, a processing unit 11033 is configured to add noise to the image feature vector to obtain a first feature vector; the method comprises the steps that through a downsampling module in an image generation model, a first feature vector is downsampled based on a text embedded vector, and a second feature vector is obtained; the method comprises the steps that feature extraction is carried out on a first feature vector based on a text embedded vector through a bypass module in an image generation model to obtain a third feature vector, wherein the bypass module is a parameter adjustable module in the image generation model; determining to generate an image based on the second feature vector and the third feature vector;
the training unit 11034 is configured to adjust parameters of the bypass module in the image generation model with a goal of minimizing a gap between the generated image and the single object image of the object.
In some embodiments, with continued reference to fig. 12, the apparatus further comprises:
a second obtaining module 1104, configured to generate, for any historical object, a historical text of the historical object based on the name, the category, and the text template of the historical object; and inputting the history text of the history object into the image generation model trained based on the history object to obtain the history image of the history object.
In some embodiments, with continued reference to fig. 12, the apparatus further comprises:
a second generating module 1105, configured to generate, when there are a plurality of history objects, a history combined text of a plurality of histories of the object based on names, categories, and text templates of the plurality of history objects; inputting a history combined text into an image generation model trained based on a plurality of history objects to obtain history combined images of the plurality of history objects, wherein a text template comprises positions; alternatively, when a plurality of history objects exist, a history combined image of the plurality of history objects is synthesized based on the history images of the plurality of history objects generated by the image generation model.
In some embodiments, with continued reference to fig. 12, training module 1103 includes:
a first determining unit 11035, configured to determine, for a non-first round of training, a first loss of the image generation model based on a single object image-text pair of the incremental object;
A second determining unit 11036 configured to determine a second loss of the image generation model based on the history image-text pair of the at least one history object;
a third determining unit 11037 for determining a third loss of the image generation model based on the combined graphic pair including at least one history object and the delta object;
the training unit 11034 is configured to train the image generation model based on the first loss, the second loss, and the third loss.
In some embodiments, with continued reference to fig. 12, the third determining unit 11037 is configured to split the combined text in the combined image-text pair including at least one history object and the delta object to obtain at least one first text and a second text, where each first text is used to describe a position of a corresponding history object in the combined image, and the second text is used to describe a position of the delta object in the combined image; for any text in at least one first text and at least one second text, processing the text and the combined image through an image generation model to obtain a first generated image; determining a single object loss based on a difference between the first generated image and the combined image; processing the combined text and the combined image through an image generation model to obtain a second generated image; determining a combining loss based on a difference between the second generated image and the combined image; a third penalty is determined for each of the single object penalty, and the combined penalty for the delta object for the at least one historical object.
In some embodiments, with continued reference to fig. 12, a third determining unit 11037 is configured to acquire, for any one of the at least one history object and the delta object, a mask image of the object, the mask image being used to indicate a position of the object in the combined image; determining a target region in the combined image based on the mask image of the object; and determining the difference between the combined image and the first generated image at the target area to obtain the single object loss of the object.
In some embodiments, with continued reference to fig. 12, the apparatus further comprises:
a third obtaining module 1106, configured to obtain, for any one of the at least one history object and the incremental object, a paste position of a single object image of the object in the combined image and a blank image, where the blank image has a size identical to that of the combined image; setting a pixel value at a first area corresponding to a pasting position in the blank image as a first numerical value, wherein the first numerical value is used for indicating the position of a single object image of an object; setting pixel values at a second area and a third area in the blank image as second numerical values to obtain a mask image, wherein the second area is other areas except the first area in the blank image, and the third area is an area in the first area and is positioned at the joint of the first area and the second area.
The embodiment of the application provides a training device of an image generation model, which is used for generating a combined image-text pair comprising at least two objects in a plurality of objects according to a single object image-text pair of the plurality of objects to be synthesized into one image before training, and then training the image generation model through the single object image-text pair of each object, so that the image generation model can accurately learn the characteristics of each object, and provides a guarantee for generating a combined image for a subsequent model; and training the image generation model through at least one combined image-text pair, so that the characteristics of each object can be repeatedly learned in the aspect of object combination, the characteristics of the objects learned before the model forgets along with the training process are avoided, and the position information in the text can be combined, so that each object can be more accurately distinguished by the image generation model, and the characteristics of each object can be more accurately learned. That is, the positions of the objects in the combined image are specified through the combined text, so that the problem that the distribution of a plurality of objects in the combined image is absent can be solved by the image generation model, each object can be distinguished more accurately, the trained image generation model can more accurately synthesize a plurality of objects on one image, the aim of synthesis is met, and the quality of the combined image is improved.
It should be noted that: the training device for the image generation model provided in the above embodiment is only exemplified by the division of the above functional modules when the application program is running, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the training device of the image generation model provided in the above embodiment and the training method embodiment of the image generation model belong to the same concept, and detailed implementation processes of the training device and the training method embodiment of the image generation model are detailed in the method embodiment, and are not repeated here.
Fig. 13 is a block diagram of an image generating apparatus provided according to an embodiment of the present application. Referring to fig. 13, the image generating apparatus includes:
the acquiring module 1301 acquires a target text, wherein the target text comprises names of a plurality of objects to be synthesized into one image, and the target text is used for indicating the generation requirement of the image;
the generating module 1302 is configured to process the target text through an image generating model to obtain a combined image including a plurality of objects, where the image generating model is obtained based on a single object image pair of the plurality of objects and at least one combined image pair, the single object image pair of each object includes a single object image and a single object text, the single object image of each object includes an object, the single object text of each object includes a name and a category of the object, each combined image pair includes a combined image and a combined text, each combined image includes at least two objects of the plurality of objects, and each combined text includes a name, a category, and a location of each object in the corresponding combined image.
According to the image generation device provided by the embodiment of the application, the positions of the objects in the combined image are specified through the combined text, so that the problem that the distribution of a plurality of objects in the combined image is absent can be solved by the image generation model, each object can be distinguished more accurately, and the trained image generation model can more accurately synthesize the plurality of objects on one image, so that the quality of the combined image generated through the trained image generation model is better, namely, the combined image accurately contains the objects to be synthesized, the synthesis effect is natural, and the boundary sense of splicing or pasting is avoided.
It should be noted that: the image generating apparatus provided in the above embodiment is only exemplified by the above-described division of each functional module when an application is running, and in practical application, the above-described allocation of functions may be performed by different functional modules according to needs, i.e., the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above. In addition, the image generating apparatus and the image generating method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments, which are not described herein again.
In the embodiment of the present application, the computer device may be configured as a terminal or a server, and when the computer device is configured as a terminal, the technical solution provided in the embodiment of the present application may be implemented by the terminal as an execution body, and when the computer device is configured as a server, the technical solution provided in the embodiment of the present application may be implemented by the server as an execution body, and also the technical solution provided in the present application may be implemented by interaction between the terminal and the server, which is not limited in this embodiment of the present application.
Fig. 14 is a block diagram of a terminal 1400 provided according to an embodiment of the present application. Terminal 1400 includes: a processor 1401 and a memory 1402.
Processor 1401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1401 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1401 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1401 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content that the display screen is required to display. In some embodiments, the processor 1401 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.
Memory 1402 may include one or more computer-readable storage media, which may be non-transitory. Memory 1402 may also include high-speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1402 is used to store at least one computer program for execution by processor 1401 to implement the training method or image generation method of the image generation model provided by the method embodiments in the present application.
In some embodiments, terminal 1400 may optionally further include: a peripheral interface 1403 and at least one peripheral. The processor 1401, memory 1402, and peripheral interface 1403 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1403 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1404, a display screen 1405, a camera assembly 1406, audio circuitry 1407, and a power source 1408.
Peripheral interface 1403 may be used to connect at least one Input/Output (I/O) related peripheral to processor 1401 and memory 1402. In some embodiments, processor 1401, memory 1402, and peripheral interface 1403 are integrated on the same chip or circuit board; in some other embodiments, either or both of processor 1401, memory 1402, and peripheral interface 1403 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.
The Radio Frequency circuit 1404 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1404 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1404 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. In some embodiments, the radio frequency circuit 1404 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 1404 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 1404 may also include NFC (Near Field Communication, short range wireless communication) related circuits, which are not limited in this application.
The display screen 1405 is used to display UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1405 is a touch display screen, the display screen 1405 also has the ability to collect touch signals at or above the surface of the display screen 1405. The touch signal may be input to the processor 1401 as a control signal for processing. At this time, the display 1405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1405 may be one, disposed on the front panel of the terminal 1400; in other embodiments, the display 1405 may be at least two, respectively disposed on different surfaces of the terminal 1400 or in a folded design; in other embodiments, the display 1405 may be a flexible display disposed on a curved surface or a folded surface of the terminal 1400. Even more, the display 1405 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The display 1405 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.
The camera component 1406 is used to capture images or video. In some embodiments, camera assembly 1406 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1406 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.
The audio circuitry 1407 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1401 for processing, or inputting the electric signals to the radio frequency circuit 1404 for voice communication. For purposes of stereo acquisition or noise reduction, a plurality of microphones may be provided at different portions of the terminal 1400, respectively. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1401 or the radio frequency circuit 1404 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuitry 1407 may also include a headphone jack.
A power supply 1408 is used to provide power to various components in terminal 1400. The power supply 1408 may be alternating current, direct current, disposable battery, or rechargeable battery. When the power supply 1408 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, terminal 1400 also includes one or more sensors 1409. The one or more sensors 1409 include, but are not limited to: acceleration sensor 1410, gyroscope sensor 1411, pressure sensor 1412, optical sensor 1413, and proximity sensor 1414.
The acceleration sensor 1410 may detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 1400. For example, the acceleration sensor 1410 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1401 may control the display screen 1405 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 1410. Acceleration sensor 1410 may also be used for the acquisition of motion data of a game or user.
The gyro sensor 1411 may detect a body direction and a rotation angle of the terminal 1400, and the gyro sensor 1411 may collect a 3D motion of the user to the terminal 1400 in cooperation with the acceleration sensor 1410. The processor 1401 can realize the following functions according to the data collected by the gyro sensor 1411: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.
Pressure sensor 1412 may be disposed at a side frame of terminal 1400 and/or below display 1405. When the pressure sensor 1412 is provided at a side frame of the terminal 1400, a grip signal of the terminal 1400 by a user may be detected, and the processor 1401 performs a right-and-left hand recognition or a quick operation according to the grip signal collected by the pressure sensor 1412. When the pressure sensor 1412 is disposed at the lower layer of the display screen 1405, the processor 1401 realizes control of the operability control on the UI interface according to the pressure operation of the user on the display screen 1405. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.
The optical sensor 1413 is used to collect the ambient light intensity. In one embodiment, processor 1401 may control the display brightness of display screen 1405 based on the intensity of ambient light collected by optical sensor 1413. Specifically, when the intensity of the ambient light is high, the display luminance of the display screen 1405 is turned high; when the ambient light intensity is low, the display luminance of the display screen 1405 is turned down. In another embodiment, the processor 1401 may also dynamically adjust the shooting parameters of the camera assembly 1406 based on the ambient light intensity collected by the optical sensor 1413.
A proximity sensor 1414, also referred to as a distance sensor, is typically provided on the front panel of terminal 1400. The proximity sensor 1414 is used to collect the distance between the user and the front of the terminal 1400. In one embodiment, when proximity sensor 1414 detects a gradual decrease in the distance between the user and the front of terminal 1400, processor 1401 controls display 1405 to switch from the on-screen state to the off-screen state; when the proximity sensor 1414 detects that the distance between the user and the front surface of the terminal 1400 gradually increases, the processor 1401 controls the display 1405 to switch from the off-screen state to the on-screen state.
Those skilled in the art will appreciate that the structure shown in fig. 14 is not limiting and that terminal 1400 may include more or less components than those illustrated, or may combine certain components, or employ a different arrangement of components.
Fig. 15 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 1501 and one or more memories 1502, where the memories 1502 store at least one computer program, and the at least one computer program is loaded and executed by the processor 1501 to implement the training method or the image generating method of the image generating model provided in the above method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.
The present application also provides a computer readable storage medium, in which at least one section of a computer program is stored, where the at least one section of the computer program is loaded and executed by a processor of a computer device to implement the training method of the image generation model or the operations performed by the computer device in the image generation method of the above embodiment. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.
Embodiments of the present application also provide a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device executes the training method or the image generation method of the image generation model provided in the above-described various alternative implementations.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims (18)

1. A method of training an image generation model, the method comprising:
acquiring single object image-text pairs of a plurality of objects, wherein the single object image-text pairs of each object comprise a single object image and a single object text, the single object image of each object comprises the objects, and the single object text of each object comprises the names and the categories of the objects;
generating at least one combined image-text pair based on single object image-text pairs of the plurality of objects, wherein each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the name, the category and the position of each object in the corresponding combined image;
and training the image generation model based on the single object image-text pairs and at least one combined image-text pair of the plurality of objects.
2. The method of claim 1, wherein the generating at least one combined pair of graphics based on the single object pair of graphics for the plurality of objects comprises:
Randomly acquiring single-object image-text pairs of a preset number of objects from the single-object image-text pairs of the plurality of objects;
synthesizing the single object images of the preset number of objects into an image by adopting a mapping mode to obtain a combined image;
and generating a combined text corresponding to the combined image based on the single-object text of the preset number of objects and the positions of the objects in the combined image.
3. The method according to claim 2, wherein said mapping the single object images of the preset number of objects into an image, to obtain a combined image, includes:
at least one of cropping and scaling the single object images of the preset number of objects based on a preset size, wherein the preset size is used for representing the size of a combined image to be synthesized;
and pasting the processed single object images of the preset number of objects into one image to obtain the combined image.
4. The method of claim 1, wherein training the image generation model based on the single object pairs and at least one combined pair of images of the plurality of objects comprises:
For first-round training, training the image generation model based on a single object image-text pair of any object in the plurality of objects;
for non-first-round training, training the image generation model based on a single-object image-text pair of an incremental object, a historical image-text pair of at least one historical object and a combined image-text pair comprising the at least one historical object and the incremental object, wherein the incremental object is a newly added object in the current round training, and the at least one historical object is an object which has participated in the training.
5. The method of claim 4, wherein for the first round of training, training the image generation model based on a single object-pair of any of the plurality of objects comprises:
for first training, coding a single object image of the object through the image generation model to obtain an image feature vector;
encoding a single object text of the object to obtain a text embedded vector, wherein the text embedded vector is used for guiding the image generation model to eliminate noise interference;
processing the image feature vector and the text embedding vector to obtain a generated image;
The image generation model is trained based on a gap between the generated image and a single object image of the object.
6. The method of claim 5, wherein said processing said image feature vector and said text-embedding vector to obtain a generated image comprises:
adding noise to the image feature vector to obtain a first feature vector;
the first feature vector is downsampled based on the text embedding vector through a downsampling module in the image generation model to obtain a second feature vector;
extracting features of the first feature vector based on the text embedded vector through a bypass module in the image generation model to obtain a third feature vector, wherein the bypass module is a parameter adjustable module in the image generation model;
determining the generated image based on the second feature vector and the third feature vector;
the training the image generation model based on the difference between the generated image and the single object image of the object comprises:
and adjusting parameters of the bypass module in the image generation model with the aim of minimizing the gap between the generated image and the single object image of the object.
7. The method of claim 4, wherein the process of obtaining the historical teletext pairs of the at least one historical object comprises:
for any historical object, generating a historical text of the historical object based on the name, the category and a text template of the historical object;
and inputting a history text of the history object into an image generation model trained based on the history object to obtain a history image of the history object.
8. The method of claim 7, wherein the method further comprises:
generating a history combined text of a plurality of history objects based on names, categories and text templates of the plurality of history objects when the plurality of history objects exist; inputting the history combined text into an image generation model trained based on the plurality of history objects to obtain history combined images of the plurality of history objects, wherein the text template comprises positions; or,
when a plurality of history objects exist, a history combined image of the plurality of history objects is synthesized based on the history images of the plurality of history objects generated by the image generation model.
9. The method of claim 4, wherein for non-first round training, the training the image generation model based on a single object pair of incremental objects, a historical pair of at least one historical object, and a combined pair of images including the at least one historical object and the incremental object, comprises:
for non-first-round training, determining a first loss of the image generation model based on a single-object image-text pair of the incremental object;
determining a second loss of the image generation model based on the historical graphic pairs of the at least one historical object;
determining a third loss of the image generation model based on a combined graphic pair comprising the at least one historical object and the incremental object;
the image generation model is trained based on the first, second, and third losses.
10. The method of claim 9, wherein determining a third loss of the image generation model based on the combined teletext pairs comprising the at least one history object and the delta object comprises:
splitting a combined text in a combined image-text pair containing the at least one historical object and the incremental object to obtain at least one first text and a second text, wherein each first text is used for describing the position of the corresponding historical object in the combined image, and the second text is used for describing the position of the incremental object in the combined image;
For any text in the at least one first text and the second text, processing the text and the combined image through the image generation model to obtain a first generated image; determining a single object loss based on a difference between the first generated image and the combined image;
processing the combined text and the combined image through the image generation model to obtain a second generated image; determining a combining loss based on a difference between the second generated image and the combined image;
determining the third penalty for each of the single object penalty for the at least one historical object, the single object penalty for the incremental object, and the combined penalty.
11. The method of claim 10, wherein the determining a single object loss based on a difference between the first generated image and the combined image comprises:
for any one of the at least one history object and the delta object, obtaining a mask image of the object, the mask image being used to indicate the position of the object in the combined image;
determining a target region in the combined image based on the mask image of the object;
And determining the difference between the combined image and the first generated image at the target area to obtain a single object loss of the object.
12. The method of claim 11, wherein the generating of the mask image comprises:
for any one object of the at least one history object and the increment object, acquiring a pasting position of a single object image of the object in the combined image and a blank image, wherein the blank image has the same size as the combined image;
setting a pixel value at a first area corresponding to the pasting position in the blank image as a first numerical value, wherein the first numerical value is used for indicating the position of a single object image of the object;
and setting pixel values at a second area and a third area in the blank image as second numerical values to obtain the mask image, wherein the second area is other areas except the first area in the blank image, and the third area is an area in the first area and is positioned at the joint of the first area and the second area.
13. An image generation method, the method comprising:
Acquiring a target text, wherein the target text comprises names of a plurality of objects to be synthesized into one image, and the target text is used for indicating the generation requirement of the image;
processing the target text through an image generation model to obtain a combined image containing the plurality of objects, wherein the image generation model is trained based on single object image-text pairs and at least one combined image-text pair of the plurality of objects, the single object image-text pairs of each object comprise single object images and single object texts, the single object images of each object comprise the objects, the single object texts of each object comprise names and categories of the objects, each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the names, the categories and the positions of the respective objects in the corresponding combined image.
14. A training apparatus for an image generation model, the apparatus comprising:
the first acquisition module is used for acquiring single-object image-text pairs of a plurality of objects, wherein the single-object image-text pairs of each object comprise single-object images and single-object texts, the single-object images of each object comprise the objects, and the single-object texts of each object comprise names and categories of the objects;
The first generation module is used for generating at least one combined image-text pair based on the single object image-text pair of the plurality of objects, wherein each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the name, the category and the position of each object in the corresponding combined image;
and the training module is used for training the image generation model based on the single object image-text pairs and at least one combined image-text pair of the plurality of objects.
15. An image generation apparatus, the apparatus comprising:
the system comprises an acquisition module, a target text acquisition module and a display module, wherein the target text comprises names of a plurality of objects to be synthesized into one image, and the target text is used for indicating the generation requirement of the image;
the generating module is used for processing the target text through an image generating model to obtain a combined image containing a plurality of objects, the image generating model is trained based on single object image-text pairs of the plurality of objects and at least one combined image-text pair, the single object image-text pairs of each object comprise single object images and single object texts, the single object images of each object comprise the objects, the single object texts of each object comprise names and categories of the objects, each combined image-text pair comprises a combined image and a combined text, each combined image comprises at least two objects in the plurality of objects, and each combined text comprises the names, the categories and the positions of the objects in the corresponding combined image.
16. A computer device, characterized in that it comprises a processor and a memory for storing at least one section of a computer program, which is loaded by the processor and which performs the training method of the image generation model according to any of claims 1 to 12; alternatively, the at least one computer program is loaded by the processor and performs the image generation method of claim 13.
17. A computer readable storage medium for storing at least one segment of a computer program for performing the training method of the image generation model of any one of claims 1 to 12; alternatively, the at least one computer program is for performing the image generation method of claim 13.
18. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements a training method of an image generation model according to any of claims 1 to 12; alternatively, the computer program, when executed by a processor, implements the image generation method of claim 13.
CN202311230141.8A 2023-09-20 2023-09-20 Training method of image generation model, image generation method, device and equipment Pending CN117351115A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311230141.8A CN117351115A (en) 2023-09-20 2023-09-20 Training method of image generation model, image generation method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311230141.8A CN117351115A (en) 2023-09-20 2023-09-20 Training method of image generation model, image generation method, device and equipment

Publications (1)

Publication Number Publication Date
CN117351115A true CN117351115A (en) 2024-01-05

Family

ID=89362261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311230141.8A Pending CN117351115A (en) 2023-09-20 2023-09-20 Training method of image generation model, image generation method, device and equipment

Country Status (1)

Country Link
CN (1) CN117351115A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118155023A (en) * 2024-05-11 2024-06-07 腾讯科技(深圳)有限公司 Text graph and model training method and device, electronic equipment and storage medium
CN118154912A (en) * 2024-05-11 2024-06-07 北京百度网讯科技有限公司 Image generation model processing method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118155023A (en) * 2024-05-11 2024-06-07 腾讯科技(深圳)有限公司 Text graph and model training method and device, electronic equipment and storage medium
CN118154912A (en) * 2024-05-11 2024-06-07 北京百度网讯科技有限公司 Image generation model processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110148102B (en) Image synthesis method, advertisement material synthesis method and device
CN111091166B (en) Image processing model training method, image processing device, and storage medium
CN117351115A (en) Training method of image generation model, image generation method, device and equipment
CN112562019A (en) Image color adjusting method and device, computer readable medium and electronic equipment
CN111327772B (en) Method, device, equipment and storage medium for automatic voice response processing
CN111860485A (en) Training method of image recognition model, and image recognition method, device and equipment
CN112381707B (en) Image generation method, device, equipment and storage medium
CN108701355A (en) GPU optimizes and the skin possibility predication based on single Gauss online
CN110555102A (en) media title recognition method, device and storage medium
CN113705302A (en) Training method and device for image generation model, computer equipment and storage medium
CN116861850A (en) Data processing method and device
CN113744286A (en) Virtual hair generation method and device, computer readable medium and electronic equipment
CN117615200A (en) Video generation method, device, computer readable storage medium and electronic equipment
CN115661320A (en) Image processing method and electronic device
CN113836946B (en) Method, device, terminal and storage medium for training scoring model
CN112528760B (en) Image processing method, device, computer equipment and medium
CN113821658A (en) Method, device and equipment for training encoder and storage medium
CN113642359B (en) Face image generation method and device, electronic equipment and storage medium
CN113284206A (en) Information acquisition method and device, computer readable storage medium and electronic equipment
CN113205569A (en) Image drawing method and device, computer readable medium and electronic device
CN113516665A (en) Training method of image segmentation model, image segmentation method, device and equipment
CN113192072B (en) Image segmentation method, device, equipment and storage medium
CN113486260A (en) Interactive information generation method and device, computer equipment and storage medium
CN114898282A (en) Image processing method and device
CN111797754A (en) Image detection method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination