CN117611947A

CN117611947A - Training method and device for image generation model, electronic equipment and storage medium

Info

Publication number: CN117611947A
Application number: CN202311668162.8A
Authority: CN
Inventors: 宁苒宇; 吴远泸; 刘乙邑; 甄志坚; 梁可弘
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2023-12-06
Filing date: 2023-12-06
Publication date: 2024-02-27

Abstract

The application provides a training method, a training device, electronic equipment and a storage medium of an image generation model, wherein the method comprises the steps of obtaining a training sample data set; the training sample data set comprises a plurality of target class sample images; performing target local recognition on the target sample image to obtain an image frame corresponding to the target local; determining a sample expansion number based on a first ratio of the area of the image frame to the area of the target class sample image; generating an expanded sample image based on the image frame and copying the expanded sample image based on the expanded number of samples; the copied expanded sample image is expanded into the training sample data set, and a target image generation model is trained based on the expanded training sample data set, so that the capability of the image generation model for processing local details is improved through reasonable expansion of image local sample data, and the problem of poor generation effect caused by small occupation of key parts in the image can be effectively solved.

Description

Training method and device for image generation model, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image generation technologies, and in particular, to a training method and apparatus for an image generation model, an electronic device, and a storage medium.

Background

This section is intended to provide a background or context for embodiments of the present application that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

The text-to-image generation task is aimed at by generating a realistic image corresponding to the input text description. To enable text-to-image generation, training of deep learning models is typically performed using large-scale image and text datasets, which can translate a given text description into an appropriate image by learning semantic relationships between the image and the text. In the related art, for certain specific key areas with smaller area in an image, such as face and hand areas, the existing image generation model has limited capability in processing such complex local details, so that the local detail generation effect is poor.

Disclosure of Invention

In view of the foregoing, an object of the present application is to provide a training method, device, electronic apparatus and storage medium for image generation model, which are used for solving or partially solving the above-mentioned problems in the background art.

Based on the above objects, the present application provides a training method for an image generation model, including:

acquiring a training sample data set; the training sample dataset comprises a plurality of target class sample images;

performing target local recognition on the target sample image to obtain an image frame corresponding to the target local;

determining a sample expansion number based on a first ratio of the area of the image frame to the area of the target class sample image;

generating an expanded sample image based on the image frame and copying the expanded sample image based on the expanded number of samples;

and expanding the copied expanded sample image into the training sample data set, and training a target image based on the expanded training sample data set to generate a model.

Based on the same inventive concept, the exemplary embodiments of the present application also provide another training method of an image generation model, including:

dividing the training sample data set into a plurality of sub-sample data sets based on the size of the target class sample image; wherein the sample images in each sub-sample data set are the same size and the sample images in different sub-sample data sets are different sizes;

And training the target image generation model in batches based on each sub-sample data set.

Based on the same inventive concept, the exemplary embodiments of the present application further provide a training apparatus for an image generation model, including:

the first acquisition module acquires a training sample data set; the training sample dataset comprises a plurality of target class sample images;

the identification module is used for carrying out target local identification on the target sample image to obtain an image frame corresponding to the target local;

the determining module is used for determining the sample expansion quantity based on a first ratio of the area of the image frame to the area of the target sample image;

a generation module that generates an expanded sample image based on the image frame and copies the expanded sample image based on the sample expansion number;

and the first training module expands the copied expanded sample image into the training sample data set, and trains a target image based on the expanded training sample data set to generate a model.

Based on the same inventive concept, the exemplary embodiments of the present application also provide another training apparatus for generating a model by using an image, including:

the second acquisition module acquires a training sample data set; the training sample dataset comprises a plurality of target class sample images;

A grouping module that divides the training sample data set after expansion into a plurality of sub-sample data sets based on the size of the target class sample image; wherein the sample images in each sub-sample data set are the same size and the sample images in different sub-sample data sets are different sizes;

and the second training module trains the target image generation model in batches based on each sub-sample data set.

Based on the same inventive concept, the exemplary embodiments of the present application also provide an electronic device including a memory, a processor, and a computer program stored on the memory and executable by the processor, the processor implementing the training method of the image generation model as described above when executing the program.

Based on the same inventive concept, the present exemplary embodiments also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the training method of the image generation model as described above.

Based on the same inventive concept, the exemplary embodiments of the present application also provide a computer program product comprising a computer program that is executed by one or more processors to cause the processors to perform the method of adjusting game sounds as described above.

From the above, it can be seen that the training method, apparatus, electronic device and storage medium for image generation model provided by the present application obtain a training sample data set; the training sample dataset comprises a plurality of target class sample images; performing target local recognition on the target sample image to obtain an image frame corresponding to the target local; determining a sample expansion number based on a first ratio of the area of the image frame to the area of the target class sample image; generating an expanded sample image based on the image frame and copying the expanded sample image based on the expanded number of samples; the copied expanded sample image is expanded into the training sample data set, and a target image generation model is trained based on the expanded training sample data set, so that the capability of the image generation model for processing local details is improved through reasonable expansion of image local sample data, and the problem of poor generation effect caused by small occupation of key parts in the image can be effectively solved.

Drawings

In order to more clearly illustrate the technical solutions of the present application or related art, the drawings that are required to be used in the description of the embodiments or related art will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

Fig. 1 is a schematic diagram of an application scenario in an embodiment of the present application;

FIG. 2 is a flowchart of a training method of an image generation model according to an embodiment of the present application;

FIG. 3 is a schematic diagram showing the effect of an image generated by an image generation model in the related art;

FIG. 4 is a schematic diagram of the effect of an image generated by an image generation model according to an embodiment of the present application;

FIG. 5 is a schematic diagram showing the effect of generating an image without increasing local loss according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating an effect of generating an image after increasing local loss according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an effect of generating an image when sample data sets are not grouped according to an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating an effect of grouping sample data sets to generate an image according to an embodiment of the present application;

FIG. 9 is a flowchart of another training method of an image generation model according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a training device for generating an image model according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of another training device for generating an image model according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a specific electronic device according to an embodiment of the present application.

Detailed Description

The principles and spirit of the present application will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present application and are not intended to limit the scope of the present application in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present disclosure pertains. The terms "first," "second," and the like, as used in embodiments of the present disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed. Furthermore, in the description of the present application, unless otherwise indicated, the term "plurality" refers to two or more. The term "and/or" describes an association relationship of associated objects, meaning that there may be three relationships, e.g., a and/or B, which may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

According to an embodiment of the application, a training method, a training system, electronic equipment and a storage medium of an image generation model are provided.

In this document, it should be understood that any number of elements in the drawings is for illustration and not limitation, and that any naming is used only for distinction and not for any limitation.

The principles and spirit of the present application are explained in detail below with reference to several representative embodiments thereof.

Summary of The Invention

In the related art, in order to achieve text-to-image generation, training of a deep learning model is typically performed using a large-scale image and text data set, and the model can convert a given text description into an appropriate image by learning the semantic relationship between the image and the text. However, in the related art, for certain specific key areas, such as face and hand areas, which occupy a relatively small area in an image, the existing image generation model has limited capability in processing such complex local details, which results in poor local detail generation effect.

In order to solve the above problems, the present application provides a training method of an image generation model, which specifically includes:

acquiring a training sample data set; the training sample dataset comprises a plurality of target class sample images; performing target local recognition on the target sample image to obtain an image frame corresponding to the target local; determining a sample expansion number based on a first ratio of the area of the image frame to the area of the target class sample image; generating an expanded sample image based on the image frame and copying the expanded sample image based on the expanded number of samples; the copied expanded sample image is expanded into the training sample data set, and a target image generation model is trained based on the expanded training sample data set, so that the capability of processing local details of the image generation model is improved through reasonable expansion of image local sample data, the problem of poor generation effect caused by small occupation of key parts in the image can be effectively relieved, meanwhile, local characteristics of the image are enhanced, and local detail generation effects, such as detail effects of faces, hands and the like of a character image, can be optimized.

Having described the basic principles of the present application, various non-limiting embodiments of the present application are specifically described below.

Application scene overview

In some specific application scenarios, the training method of the image generation model of the present application may be applied to various systems involving training of the image generation model. Alternatively, the system may be a game or animation system, and as an example, referring to fig. 1, the application scenario includes at least one server 102 and at least one terminal 101. Terminal devices include, but are not limited to, desktop computers, mobile phones, mobile computers, tablet computers, media players, smart wearable device televisions, personal digital assistants (personal digital assistant, PDAs), or other electronic devices capable of performing the functions described above, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. The server and the terminal can communicate through a network so as to realize data transmission. The network may be a wired network or a wireless network, which is not specifically limited in this application.

The server may be a server providing various services. In particular, the server may be configured to provide background services for applications running on the terminal. Alternatively, in some implementations, the training method of the image generation model provided in the embodiments of the present application may be performed by a terminal device or a server. The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services), or as a single software or software module. The embodiment of the present application is not particularly limited thereto.

Alternatively, the wireless network or wired network uses standard communication techniques and/or protocols. The network is typically the internet, but can be any network including, but not limited to, a local area network (local area network, LAN), metropolitan area network (metropolitan area network, MAN), wide area network (wide area network, WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks. In some embodiments, data exchanged over the network is represented using techniques and/or formats including hypertext markup language (HTML), extensible markup language (extensible markup language, XML), and the like. In addition, all or some of the links can be encrypted using conventional encryption techniques such as secure socket layer (secure socket layer, SSL), transport layer security (transport layer security, TLS), virtual private network (virtual private network, VPN), internet protocol security (internet protocol security, IPsec), and the like. In other embodiments, custom and/or dedicated data communication techniques can also be used in place of or in addition to the data communication techniques described above.

The training method of the image generation model according to the exemplary embodiment of the present application is described below in connection with a specific application scenario. It should be noted that the above application scenario is only shown for the convenience of understanding the spirit and principles of the present application, and embodiments of the present application are not limited in any way in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

Exemplary method

Referring to fig. 2, an embodiment of the present application provides a training method of an image generation model, where an execution subject of the training method of the image generation model may be, but is not limited to, a server or a terminal device. The method comprises the following steps:

s101, acquiring a training sample data set; the training sample dataset includes a plurality of target class sample images.

In particular, a training sample data set is obtained during the generation of the model by training the image, and optionally, the training sample data set may be obtained through user input or through a sample data set disclosed on the network. It should be noted that, the training sample data mainly includes a plurality of target class sample images, the target class mainly refers to a type of an image that the current user wants to generate, and specific target classes may be set according to needs, which is not limited. The object class sample image belongs to a sample image of the object class. For example, if a user wants to train an image generation model capable of generating a dongle, the target class at this time is sand Pi Gou, and the target class sample image is an image of various dongles. Alternatively, in some embodiments, the target class may be a specific object, for example, a user wants to generate own images in various scenes, and then the target class is the user, and the corresponding target class images are the user's own various images.

S102, carrying out target local recognition on the target sample image to obtain an image frame corresponding to the target local.

In the embodiment, after the training sample data set is obtained, the conventional training method may directly perform training of the image generation model through the obtained training sample data set, however, in this embodiment, in order to further improve capturing and learning of details by the image generation model, after the training sample data set is obtained, further enhancement processing is required to be performed on data in the sample data set, and when the enhancement processing is performed, target local recognition is performed on the target sample image, so as to obtain an image frame corresponding to the target local. Alternatively, in performing the target local recognition on the target class sample image, any local recognition method in the prior art may be referred to, for example, the face or the hand of the target class sample image may be recognized by a local recognition model in the related art, and optionally, the local recognition model may be YOLOv3. The target part in this embodiment may be set as required, for example, the target part may be a designated part on an image such as a face or a hand of the sample image, and the corresponding target part may be identified by a face or a hand identification model. The image frame corresponding to the target part is an enclosure frame comprising the target part, which is determined from the sample image, and optionally, the image frame may be a rectangular frame.

S103, determining the sample expansion number based on a first ratio of the area of the image frame to the area of the target sample image.

In the implementation, after the image frame corresponding to the target part is obtained, the sample expansion number can be determined according to a first ratio of the area of the image frame to the area of the target sample image. Optionally, the smaller the first ratio is, the less likely the target local detail is captured, so the first ratio may be inversely proportional to the number of sample expansion, that is, the smaller the first ratio of the area of the image frame to the area of the target sample image is, the larger the corresponding number of sample expansion is.

In some embodiments, determining the sample expansion number based on a first ratio of the area of the image frame to the area of the target class sample image specifically includes:

acquiring preset expansion parameters;

and calculating a second ratio of the preset expansion parameter to the first ratio, and rounding the second ratio to determine the sample expansion quantity.

In practice, the sample expansion number can be determined by the following formula:wherein N represents the number of sample expansion, r represents the first ratio of the area of the image frame to the area of the target sample image, θ represents a preset expansion parameter, and optionally, the preset expansion parameter may be set according to needs, which is not limited, for example, the preset expansion parameter may be set to 1 or 2./ >Representing a rounding operation.

In some embodiments, a difference between the preset expansion parameter and the inverse of the first ratio may also be directly calculated, and then the number of sample expansion may be determined by the difference. For example, the preset expansion parameter is set to 20, and if a certain first ratio is 1/10, the corresponding difference is 20-10=10, and the number of sample expansion at this time is 10.

S104, generating an expanded sample image based on the image frame, and copying the expanded sample image based on the sample expansion quantity.

When the method is implemented, after the sample expansion data are determined, an expansion sample image is generated according to the image frame, and then the expansion sample image is copied according to the sample expansion quantity, namely the expansion sample image with the sample expansion quantity can be copied. Alternatively, when the expanded sample image is generated, the image frame may be directly cut out from the target sample image as the expanded sample image.

To ensure the integrity of the target part in the extended sample image, in some embodiments, generating the extended sample image based on the image frame specifically includes:

in the target sample image, expanding the image frame outwards by a preset proportion;

And cutting the target sample image based on the expanded image frame to generate the expanded sample image.

In a specific implementation, considering that the identified image frame may not completely contain the target part, in order to enable the expanded sample image to include the whole target part, the image frame may be expanded outwards, optionally, the image frame may be expanded outwards by a preset proportion, where the preset proportion may be set as required, for example, the preset proportion may be set to any value between 10% and 20%. After the image frame is expanded outwards, the target sample image can be cut according to the expanded image frame, and the expanded sample image is generated.

S105, expanding the copied expanded sample image into the training sample data set, and training a target image based on the expanded training sample data set to generate a model.

In the implementation, after the copied plurality of extended sample images are obtained, all the copied extended sample images can be extended into the training sample data set, and a target image generating model is trained based on the extended training sample data set. As the number of sample data corresponding to the target part is increased in the training sample data set, the target image generation model obtained through training in the embodiment of the application is required, and the local detail of the image can be better produced. Referring to fig. 3 and 4, fig. 3 is a character image generated by training a target image directly through a training sample data set to generate a model in the related art. Fig. 4 shows a character image generated by a training sample data set generated by training the target image generation model after the training sample data set is expanded by the embodiment of the present application, and the target part corresponding to fig. 4 is a face of a person, so that the effect of the face of the person generated in fig. 4 can be seen to be significantly better than that of fig. 3.

In some embodiments, training a target image based on the extended training sample data set to generate a model specifically includes:

acquiring a first sample image and a first sample text corresponding to the first sample image from the expanded training sample data set; wherein the training sample data set further comprises a first sample text corresponding to the first sample image, the first sample image comprising the augmented sample image and the target class sample image;

inputting the first sample text and the first sample image into a reference image generation model corresponding to the target image generation model, and acquiring a predicted image output by the reference image generation model;

a functional penalty of pixel point color values of the predicted image and the first sample image is calculated.

In a specific implementation, in the embodiment of the present application, the process of training the target image generating model by using the extended training sample data set is mainly a process of fine tuning training on the diffusion model, that is, training the reference image generating model by using the sample image of the specified type, so that the reference image generating model can generate an image of the specified type. Optionally, the reference image generating model may select any image generating model in the related art according to needs, and optionally, the reference image generating model may be any diffusion model, for example, a Stable diffusion model.

It should be noted that the diffusion model is a new breakthrough of the depth generation model, and the basic idea of the diffusion model is to systematically perturb the distribution in the data by a forward diffusion process and then recover the distribution in the data by learning a backward diffusion process, thereby generating a highly flexible and easy-to-calculate generation model. The diffusion model can be applied to text-to-image generation tasks, and has excellent image generation effects. Furthermore, while existing mainstream text-generated image models are capable of generating high quality and diverse images in accordance with a given text prompt, these models still lack the ability to mimic or reproduce the concepts of object appearance, wind, etc. from a particular reference set, and also have difficulty in letting the designated object appear "perfectly" in the scene that it is intended to generate. Therefore, fine tuning of these models is needed, and fine tuning training allows these models to better understand the appearance of objects in the image or the concept of wind, thereby more accurately generating a satisfactory image.

To further enhance the control of the model over local detail, in some embodiments, calculating the functional loss of pixel color values of the predicted image and the first sample image specifically includes:

Determining the area of the target part in the target sample image, and determining a third ratio of the area of the region to the area of the target sample image;

and responding to the third ratio value being smaller than a preset threshold value, and increasing the loss weight of the pixel point corresponding to the target part based on the third ratio value.

In specific implementation, considering the importance of local details of an image, the embodiment proposes a local loss weighting method for a complex image, and performs feature enhancement on the content (target local) of a designated area through loss weighting, so as to ensure that features can be completely represented. When the image reconstruction loss in the training process is calculated, the loss weight of the target local position which needs to be focused by the model is increased, so that the model is focused on the target local position which needs to be perfected in detail during training, the capability of the model for processing local detail is improved, specifically, the key part segmentation detection model is introduced to generate an image mask, the target local content which needs to be focused is segmented, and when the area occupation ratio of the mask corresponding to the target local is smaller than a preset threshold value, the loss weight of the mask region corresponding to the target local is increased by a self-adaptive multiple.

Referring to fig. 5 and 6, the difference between fig. 5 and 6 is that the loss weight of the pixel point corresponding to the target part (face) is increased in fig. 6, so that fig. 6 is more clear and rich in face details and better in quality than the person generated in fig. 5.

It should be noted that the preset threshold may be set as required, which is not limited thereto, and for example, the preset threshold may be set to 1/10. When the loss weight of the pixel point corresponding to the target part is increased according to the third ratio, a specific increasing method can be set according to needs, and generally, the smaller the third ratio is, the larger the increased loss weight is, for example, the inverse of the third ratio can be directly used as the multiple of the loss weight of the pixel point corresponding to the target part.

To further ensure the integrity of the mask corresponding to the target part, in some embodiments, determining the area of the target part in the target sample image specifically includes:

extracting a target mask corresponding to the target part;

performing expansion processing on the target mask to obtain the expanded target mask;

and determining the area of the target local area in the target class sample image based on the expanded target mask.

In the specific implementation, the target mask corresponding to the target part is extracted first, and optionally, the target part detection segmentation is performed on the target sample image through a target part detection segmentation model to obtain the target mask corresponding to the target part. Meanwhile, in order to ensure the integrity of the edge of the target part, after the target mask is acquired, the target mask needs to be expanded to expand the extraction range of the target mask. Optionally, when expanding the target mask, a convolution operation may be performed on the target mask (the pixel value of the region corresponding to the target local is 1, and the other pixel values are 0) with a specified convolution kernel (the size is k×k, k is a positive integer), and each pixel in the mask of the scanned sample image is subjected to an and operation with a convolution kernel element and a target mask element, and if both are 0, the target pixel is 0, otherwise is 1. After the target mask is inflated, the area of the target part in the target sample image can be determined according to the inflated target mask, and optionally, the position area of the target part in the target sample image can be redetermined according to the inflated target mask, and then the area of the position area can be determined.

It should be noted that, in some embodiments, after extracting the target mask corresponding to the target local, the area of the target local in the target sample image may be determined directly through the target mask.

To prevent the image generation model from being over fitted, in some embodiments, the plurality of target class sample images includes a first target class sample image obtained directly and a second target class sample image generated by the reference image generation model through a second text; the first type of sample image corresponds to a first sample, and the second sample text is generated based on the first sample. It should be noted that, if the image generation model is trained only by the specified sample image, it may be caused that the image generation model only generates an image that is identical to the specified sample image, for example, if the image generation model is trained only by a photograph of a dongle, it may be caused that the image generation model only generates an image of the dongle when inputting a text related to the dongle, but cannot generate images of other types of dogs. To avoid such overfitting, further dog-related text, such as "one dog", "tad dog", etc., may be generated during construction of the training sample dataset, and then input into the reference image generation model, which generates a plurality of dog-related images, and uses the dog-related images as sample images in the training sample dataset, such that the image generation model may be caused to generate a specific type of dog, such as a sand dog, and may be caused to generate other types of dogs during training of the image generation model.

In the related art, a unified image clipping is required to be performed on a training sample in the model training process, so that details of many edge portions are lost, but for an image generating model, the generated contents should be complete, not lack any details or be clipped unnaturally, so as to ensure the integrity of edges of an image generated by the image generating model, in some embodiments, training a target image generating model based on the extended training sample data set specifically includes:

acquiring the first sample image from the expanded training sample dataset;

dividing the expanded training sample data set into a plurality of sub-sample data sets based on the size of the first sample image; wherein the sample images in each sub-sample data set are the same size and the sample images in different sub-sample data sets are different sizes;

training the target image in batches based on each sub-sample data set to generate a model; wherein each training of the target image generation model comprises: inputting the first sample image corresponding to the sub-sample data set into a plurality of threads running in parallel; each thread performs training of the target image generation model.

In practice, although for image data of non-uniform size, the training of the diffusion model can be performed directly on images of variable size, i.e. a single, variable-size sample per batch. However, this method is cumbersome and very slow, and because of the lack of effective regularization in small-batch training, the training is easy to be unstable, so in order to ensure that the data set meets the model training requirement and achieves the stabilizing effect in the related art, the unified training is generally performed directly after cutting to a fixed size (for example, 512×512). However, such uniform cropping of the sample image obviously results in the image losing much of the detail of the edge portion. In summary, in this embodiment, the sample images are grouped according to different size ratios, and the target image is trained in batches to generate the model, so that all the sample images in each batch have the same size (dimension) during training, but the image sizes of different training batches are different, so that the model can train on the sample images with variable sizes. Thereby preserving the integrity of the edges of the sample images of different sizes. Referring to fig. 7 and 8, fig. 7 is a schematic diagram of an effect of generating images when training sample data sets are not grouped, that is, all sample images of the training sample data sets are subjected to uniform clipping, and it can be seen that in fig. 7, details of some edges, such as hands or feet of a person, are obviously missing from the generated images. Fig. 8 is a schematic view showing the effect of grouping sample data sets to generate images, and due to the training process corresponding to fig. 8, the training sample data sets are divided into a plurality of sub-sample data sets, even though the sample images can retain various sizes, it can be seen that the image generated in fig. 8 contains more character edge details (hands, feet, etc.) than in fig. 7.

It should be noted that, in general, training of the image model is performed by using a GPU (image processor), in the GPU, sample images with the same size may run in parallel, that is, multiple images may be processed simultaneously by using parallel threads, but images with different sizes may only be processed separately. In the embodiment of the application, as the training sample data are grouped in different sizes, edge details of sample images in multiple sizes are reserved, and meanwhile, each group of sub-sample data sets also comprises multiple sample images, so that all sample images in the same sub-sample data set can be trained in the GPU at the same time, the efficiency is improved, and the model can learn common characteristics of the multiple sample images.

It should be noted that, for convenience of distinction, the sample image in the extended training sample data set is referred to as a first sample image, that is, the first sample image includes the extended sample image and the target sample image. When the first sample images in the expanded training sample data set are grouped, the expanded training sample data set can be divided into a plurality of sub-sample data sets according to the requirement. Alternatively, it may be implemented to determine the sizes of a plurality of common sample images according to the historical data, then the size of each common sample image may correspond to one sub-sample data set, and each first sample image is sequentially allocated to a plurality of sub-sample data sets according to the sizes of the sizes, since it is impossible to ensure that each first sample image can find a sub-sample data set with the same size, it should be possible to match the first sample images to the sub-sample data set with the closest size, and then perform uniform cropping in each sub-sample data set, so that the sizes of the sample images in each sub-sample data set are the same, and although cropping of the sample images occurs here again, since the sizes of multiple cropping are already implemented, the effect of cropping on the sample images is small here.

In some embodiments, the expanded training sample data set is divided into a plurality of sub-sample data sets based on the size of the first sample image, specifically comprising:

determining a plurality of groups of preset sizes corresponding to the plurality of sub-sample data sets, and determining a first aspect ratio of the preset sizes;

determining a second aspect ratio case corresponding to the first type sample image based on the size of the first sample image;

determining a target sub-sample data set corresponding to each first sample image; wherein the aspect ratio of the target sub-sample dataset is minimally different from the second aspect ratio of its corresponding first sample image;

and storing each first sample image into the corresponding target sub-sample data set, and performing scaling and clipping processing on the first sample image so that the size of the first sample image is the same as the preset size of the corresponding target sub-sample data set.

In specific implementation, when the expanded training sample data set is divided into a plurality of sub-sample data sets, a plurality of groups of preset sizes corresponding to the plurality of sub-sample data sets are determined, wherein each group of preset sizes consists of two parameters of width and height, and each group of preset sizes corresponds to one sub-sample data set. Alternatively, the plurality of sets of preset sizes may be set as needed, which is not limited, for example, the sizes of the common sample images may be found according to the history data, and then the plurality of sets of preset sizes may be determined according to the sizes of the common sample images. Alternatively, in order to divide the sample images of the respective sizes into the respective sub-sample data sets on average, the intervals of the adjacent two sets of preset sizes may be made equal. Alternatively, in some embodiments, when determining the plurality of sets of preset dimensions, an initial width may be determined, for example 256, and then the maximum of the widths and heights of the plurality of sets of preset dimensions may be defined, for example, a maximum of 1024 or less, and then each time the width is 1024 or less: finding the current maximum height such that the current maximum height is less than or equal to 1024 and the width multiplied by the height is less than or equal to 512 x 768, adding the width and current maximum height pair as a set size grouping to the list, then increasing the width by 64, continuing to repeatedly determine the current maximum height, when the width increases to the maximum value, exchanging the width and height, repeating the same operation, removing the repeated size groups in the list, and finally adding an additional 512-width and height group, thus obtaining a plurality of sets of preset sizes.

After a plurality of groups of preset sizes are determined, a first aspect ratio example of each group of preset sizes is determined, then a second aspect ratio example corresponding to the first type of sample image is determined according to the sizes of the first type of sample images, and the first sample image in the training sample data set after expansion is distributed to target sub-sample data sets corresponding to each group of preset sizes according to the first aspect ratio example and the second aspect ratio example. In specific allocation, for each first sample image in the training sample data set after expansion, calculating the aspect ratio example of the first sample image, selecting the sub-sample data set corresponding to the nearest preset size according to the absolute value of the difference between the first sample image and the aspect ratio example of each group corresponding to the preset size, and adding the first sample image into the sub-sample data set corresponding to the nearest preset size. Alternatively, in order to ensure the training quality of the first sample image, when grouping the training sample data sets after expansion, sample images with too large or too small aspect ratio may be screened first, i.e. the first sample image is excluded if the second aspect ratio of the first sample image is smaller than the specified threshold α or larger than the specified threshold β (α < β).

After each first sample image is grouped to find a corresponding target sub-sample data set, in order to ensure that the sizes of the sample images in the same sub-sample data set are the same, scaling and cutting processing can be performed on the first sample image, and a specific cutting and scaling method is not limited, so long as the same preset size of the scaled and cut first sample image and the corresponding target sub-sample data set is ensured. Optionally, to minimize clipping of the first sample image, in some embodiments, the size of the long-short side ratio of the first sample image and the long-short side ratio of the corresponding target sub-sample data set are compared first, if the long-short side ratio of the first sample image is greater than the long-short side ratio of the target sub-sample data set, the original long-short side ratio is first maintained, the short side of the first sample image is scaled to the short side size of the corresponding target sub-sample data set, and then random clipping is performed to the preset size of the corresponding target sub-sample data set. If the long-short side ratio of the first sample image is smaller than the long-short side ratio of the target sub-sample data set, the original long-short side ratio is firstly kept, the long side of the first sample image is scaled to the long side size of the corresponding target sub-sample data set, and then random clipping is carried out to the wide-high size of the corresponding group. If the long-short side ratio of the first sample image is equal to the long-short side ratio of the target sub-sample data set, the picture is directly scaled to a preset size of the corresponding target sub-sample data set. The long-short side ratio is a ratio of the long side to the short side.

In some embodiments, after scaling and cropping the first sample image, the method further comprises:

determining a total number ratio of the first sample image to a second sample image to be generated;

for each of the sub-sample data sets, obtaining a first number of the first sample images in the sub-sample data set, determining a second number of the second sample images based on the total number ratio and the first number, generating the second number of the second sample images by a reference image generation model corresponding to the target image generation model; and adding the second sample image to the sub-sample dataset.

In particular, in order to prevent the trained image generation model from fitting in multiple degrees, it is necessary to add a second sample image in each sub-sample data set, where the second sample image is a sample image generated by the reference image generation model corresponding to the target image generation model. Alternatively, the data of the second sample image to be added for each sub-sample data set may be determined according to the total number ratio of the first sample image to the second sample image to be generated. The total data ratio may be set as needed, and is not limited thereto. For example, if a total data ratio is 1:5 and a first number of first sample images in a sub-sample data set is 10, a corresponding second number of second sample images that the sub-sample data set needs to add is 50.

In some embodiments, a second number of the second sample images are generated by a reference image generation model corresponding to the target image generation model; the method specifically comprises the following steps:

acquiring a first sample text corresponding to the first sample image;

generating the second amount of second sample text based on the first sample text;

inputting the second number of second sample texts into the reference image generation model to generate the second number of second sample images.

In the implementation, when the second sample image is generated through the reference image generation model, a first sample text corresponding to the first sample image can be acquired first, then the second sample text with the second quantity is generated according to the first sample text, optionally, the second sample text is similar to the first sample text, optionally, the second sample text can be determined by inquiring a paraphrasing through a dictionary, and the text with the same type as the first sample text can be found to serve as the second sample text. After a second number of second sample texts are generated, these second sample texts may be input into the reference image generation model, and the second number of second sample images may be generated. Since the image generation model is capable of controlling the size of the generated image as required when generating the image, the reference image generation model may be controlled to generate a second sample image having the same preset size as that for each sub-sample data set.

Referring to fig. 9, a flowchart of another training method of an image generation model according to an embodiment of the present application is shown, where the training method includes the following steps:

s301, acquiring a training sample data set; the training sample dataset includes a plurality of target class sample images.

S302, dividing the training sample data set into a plurality of sub-sample data sets based on the size of the target class sample image; wherein the sample images in each sub-sample data set are the same size and the sample images in different sub-sample data sets are different sizes.

And S303, training the target image generation model in batches based on each sub-sample data set.

In particular, in this embodiment, after the training sample data set is acquired, the sample data is not expanded, and the training sample data set may be first divided into a plurality of sub-sample data sets, and the sizes of the sample images in each sub-sample data set are the same, and the sizes of the sample images in different sub-sample data sets are different. And then training the target image generation model in batches according to sub-sample data sets with different sizes, so that the sizes of target sample images of the sub-sample data sets of each batch of training models are equal, and meanwhile, the target sample images with different sizes can be used for training the target image generation model through training of different batches, so that the model can efficiently process images with different sizes and proportions, the edge local effect of the model generation image is optimized, and the quality of the overall generated image is improved.

In some embodiments, training the target image generation model per batch comprises: inputting the target class sample image corresponding to the sub-sample data set into a plurality of threads running in parallel; each thread performs training of the target image generation model.

In some embodiments, training the target image generation model per batch comprises: and inputting the target class sample images corresponding to the sub-sample data sets into a target thread, and then training the target class sample images of the sub-sample data sets through the target thread in sequence.

To accurately divide a training sample data set into a plurality of sub-sample data sets, in some embodiments, the training sample data set is divided into a plurality of sub-sample data sets based on the size of the target class sample image, specifically comprising:

determining a second aspect ratio case corresponding to the target class sample image based on the size of the target class sample image;

determining a target sub-sample data set corresponding to each target class sample image; wherein the difference between the first aspect ratio of the target sub-sample dataset and the second aspect ratio of the corresponding target class sample image is minimal;

And storing each target class sample image into the corresponding target sub-sample data set, and performing scaling and cutting processing on the target class sample image so that the size of the target class sample image is the same as the preset size of the corresponding target sub-sample data set.

In a specific implementation, in this embodiment, the process of dividing the training sample data set into a plurality of sub-sample data sets is similar to the process of dividing the extended training sample data set into a plurality of sub-sample data sets, and the two are different in that the extended training sample data set is divided into a plurality of sub-sample data sets and further includes an extended sample image, and the specific process is not repeated.

In some embodiments, determining a plurality of sets of preset dimensions corresponding to the plurality of sub-sample data sets specifically includes:

determining preset products of widths and heights corresponding to the multiple groups of preset sizes;

determining a plurality of first side lengths; the first side lengths are the width or height of the image, and the first side lengths form an arithmetic array;

for each first side length, determining a second side length corresponding to the first side length based on the first side length, a preset maximum side length and a preset product of the width and the height, and forming a group of preset sizes based on the first side length and the second side length.

In specific implementation, each set of preset dimensions includes two parameters, namely a height and a width, wherein for convenience of description, the first side is used for representing the width or the height, and it is to be noted that when the first side represents the height, the corresponding second side represents the width, and when the first side represents the width, the corresponding second side represents the height. The preset product of the width and the height corresponding to the multiple groups of preset sizes may be set as required, so as to set a unified area corresponding to the multiple groups of preset sizes, for example, the preset product of the width and the height may be set to 512×768, and then the preset product of the width and the height corresponding to each group of preset sizes is 512×768. Specifically, when a plurality of groups of preset sizes are determined, one parameter of width or height is determined, the parameter is expressed by a first edge, and then the remaining parameter (width or height) is determined by combining the preset product of the width and height corresponding to the plurality of groups of preset sizes and the preset maximum edge length, namely, a second parameter. Optionally, the preset maximum side length is used to define the maximum size of the height and width corresponding to the preset size, for example, when the preset maximum side length is set to 1024, the width and height of any set of preset sizes cannot be greater than 1024 at this time.

On the basis, the product of the width and the height of any one set of preset dimensions is limited by the preset product of the width and the height, namely, the product of the width and the height of each set of preset dimensions is smaller than or equal to the preset product of the width and the height, and when one parameter of the width or the height is known, the other parameter can be determined by the two conditions through the two conditions, so that a set of preset dimensions is finally determined.

In order to minimize clipping of the original target class sample image, in some embodiments, the target class sample image is scaled and clipped, including:

determining a first long-short side ratio of the target class sample image and a second long-short side ratio of a target sub-sample data set corresponding to the target class sample image;

in response to the first long-to-short side ratio being greater than the second long-to-short side ratio, maintaining the first long-to-short side ratio, scaling the short side of the target class sample image to be equal to the short side of the corresponding target sub-sample data set, and cropping along the long side of the target class sample image so that the long side of the target class sample image is equal to the long side of the corresponding target sub-sample data set;

in response to the first long-to-short side ratio being less than the second long-to-short side ratio, maintaining the first long-to-short side ratio, scaling the long side of the target class sample image to be equal to the long side of the target sub-sample dataset corresponding thereto, and cropping along the short side of the target class sample image so that the short side of the target class sample image is equal to the short side of the target sub-sample dataset corresponding thereto;

And in response to the first long-short side ratio being equal to the second long-short side ratio, scaling the size of the target class sample image to be equal to the preset size of the corresponding target sub-sample data set.

In particular, for the sake of convenience in calculation, the height and width of the sample image may be summarized into long sides and short sides, where long sides represent the longer side of the height or width, and short sides represent the shorter side of the height or width, i.e. the long side is greater than the short side, and optionally, when the height and width are equal, the long sides and the short sides may be randomly determined. The long-short side ratio is the ratio of the long side to the short side. Each target sub-sample dataset corresponds to a preset size that includes two parameters, a wide and a high, and therefore, the target sub-sample dataset also corresponds to two parameters, a long side and a short side. To facilitate distinguishing between the present embodiment, the first long-short side ratio is used to represent the second long-short side ratio of the target class sample image, and the second long-short side ratio is used to represent the long-short side ratio of the target sub-sample data set. In a specific cropping and scaling process, the long-short side ratio of the target class sample image and the long-short side ratio of the corresponding target sub-sample data set can be compared, if the long-short side ratio of the target class sample image is larger than the long-short side ratio of the target sub-sample data set, the short side of the first sample image is scaled to the short side of the corresponding target sub-sample data set by the original long-short side ratio, and then the target class sample image is cropped to the preset size of the corresponding target sub-sample data set randomly or according to a certain rule. If the long-short side ratio of the target class sample image is smaller than the long-short side ratio of the target sub-sample data set, the original long-short side ratio is firstly kept, the long side of the target class sample image is scaled to the long side size of the corresponding target sub-sample data set, and then random clipping is carried out to the wide-high size of the corresponding group. If the long-short side ratio of the target class sample image is equal to the long-short side ratio of the target sub-sample data set, the target class sample image is directly scaled to a preset size of the corresponding target sub-sample data set.

In some embodiments, after determining the corresponding second aspect ratio case for the target class sample image based on the size of the target class sample image, the method further comprises:

and deleting the current target class sample image in response to the second wide height corresponding to the current target class sample image not being in the range of the preset aspect ratio.

In particular, in order to ensure the quality of the target class sample image, when the training sample data set after expansion is grouped, sample images with too large or too small aspect ratio may be screened first, and if the target class sample image does not belong to the preset aspect ratio range, that is, the second aspect ratio of the target class sample image is smaller than the specified threshold α or greater than the specified threshold β (α < β), the target class sample image is excluded.

According to the training method of the image generation model, a training sample data set is obtained; the training sample dataset comprises a plurality of target class sample images; performing target local recognition on the target sample image to obtain an image frame corresponding to the target local; determining a sample expansion number based on a first ratio of the area of the image frame to the area of the target class sample image; generating an expanded sample image based on the image frame and copying the expanded sample image based on the expanded number of samples; the copied expanded sample image is expanded into the training sample data set, and a target image generation model is trained based on the expanded training sample data set, so that the capability of the image generation model for processing local details is improved through reasonable expansion of image local sample data, and the problem of poor generation effect caused by small occupation of key parts in the image can be effectively solved.

It should be noted that, the method of the embodiments of the present application may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present application, and the devices may interact with each other to complete the methods.

It should be noted that some embodiments of the present application are described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Exemplary apparatus

Based on the same inventive concept, the application also provides a training device of the image generation model, which corresponds to the method of any embodiment.

Referring to fig. 10, the training apparatus of the image generation model includes:

a first acquisition module 201 that acquires a training sample dataset; the training sample dataset comprises a plurality of target class sample images;

the identification module 202 is used for carrying out target local identification on the target sample image to obtain an image frame corresponding to the target local;

a determining module 203, configured to determine a sample expansion number based on a first ratio of an area of the image frame to an area of the target class sample image;

a generation module 204 that generates an expanded sample image based on the image frame and copies the expanded sample image based on the sample expansion number;

the first training module 205 expands the copied expanded sample image into the training sample data set, and trains a target image based on the expanded training sample data set to generate a model.

In some embodiments, the generating module 204 is specifically configured to:

In some embodiments, the determining module 203 is specifically configured to:

Acquiring preset expansion parameters;

In some embodiments, the first training module 205 is specifically configured to:

extracting a target mask corresponding to the target part;

In some embodiments, the plurality of target class sample images includes a first target class sample image obtained directly and a second target class sample image generated by the reference image generation model through a second text; the first type of sample image corresponds to a first sample, and the second sample text is generated based on the first sample.

acquiring the first sample image from the expanded training sample dataset;

In some embodiments, the apparatus further comprises an adding module for:

In some embodiments, the adding module is specifically configured to:

acquiring a first sample text corresponding to the first sample image;

For convenience of description, the above system is described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

The system of the foregoing embodiment is used to implement the training method of the corresponding image generation model in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, the application also provides another training device of the image generation model, which corresponds to the method of any embodiment.

Referring to fig. 11, the training apparatus of the image generation model includes:

a second acquisition module 401 that acquires a training sample dataset; the training sample dataset comprises a plurality of target class sample images;

a grouping module 402 that divides the expanded training sample data set into a plurality of sub-sample data sets based on the size of the target class sample image; wherein the sample images in each sub-sample data set are the same size and the sample images in different sub-sample data sets are different sizes;

a second training module 403 trains the target image generation model in batches based on the respective sub-sample data sets.

In some embodiments, each training of the target image generation model includes: inputting the target class sample image corresponding to the sub-sample data set into a plurality of threads running in parallel; each thread performs training of the target image generation model.

In some embodiments, the grouping module is specifically configured to:

In some embodiments, the grouping module is specifically further configured to:

Based on the same inventive concept, the application also provides an electronic device corresponding to the method of any embodiment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the training method of the image generation model of any embodiment when executing the program.

Fig. 12 is a schematic diagram showing a hardware structure of a more specific electronic device according to the present embodiment, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing relevant programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.

The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).

It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The electronic device of the foregoing embodiment is configured to implement the training method of the corresponding image generation model in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

The memory 1020 stores machine readable instructions executable by the processor 1010, which when the electronic device is running, communicate between the processor 1010 and the memory 1020 over the bus 1050 such that the processor 1010 performs the following instructions when running: acquiring a training sample data set; the training sample dataset comprises a plurality of target class sample images; performing target local recognition on the target sample image to obtain an image frame corresponding to the target local; determining a sample expansion number based on a first ratio of the area of the image frame to the area of the target class sample image; generating an expanded sample image based on the image frame and copying the expanded sample image based on the expanded number of samples; and expanding the copied expanded sample image into the training sample data set, and training a target image based on the expanded training sample data set to generate a model.

In one possible implementation, the instructions executed by the processor 1010 generate an extended sample image based on the image frame, specifically including:

In one possible implementation, the determining the number of sample extensions based on the first ratio of the area of the image frame to the area of the target class sample image by the processor 1010 specifically includes:

acquiring preset expansion parameters;

In a possible implementation manner, the instructions executed by the processor 1010 train the target image to generate a model based on the extended training sample data set, specifically include:

In a possible implementation manner, the calculating the function loss of the pixel point color values of the predicted image and the first sample image by the instructions executed by the processor 1010 specifically includes:

In a possible implementation manner, the determining, in the instructions executed by the processor 1010, the area of the target local area in the target class sample image specifically includes:

extracting a target mask corresponding to the target part;

In one possible implementation, the plurality of target class sample images include a first target class sample image obtained directly and a second target class sample image generated by the reference image generation model through a second text in the instructions executed by the processor 1010; the first type of sample image corresponds to a first sample, and the second sample text is generated based on the first sample.

acquiring the first sample image from the expanded training sample dataset;

In a possible implementation manner, the instructions executed by the processor 1010 divide the extended training sample data set into a plurality of sub-sample data sets based on the size of the first sample image, specifically include:

In one possible implementation, in instructions executed by the processor 1010, after scaling and cropping the first sample image, the method further includes:

In a possible implementation manner, in the instructions executed by the processor 1010, a second number of the second sample images are generated by a reference image generation model corresponding to the target image generation model; the method specifically comprises the following steps:

acquiring a first sample text corresponding to the first sample image;

By the mode, when the electronic equipment operates, a training sample data set is obtained; the training sample dataset comprises a plurality of target class sample images; performing target local recognition on the target sample image to obtain an image frame corresponding to the target local; determining a sample expansion number based on a first ratio of the area of the image frame to the area of the target class sample image; generating an expanded sample image based on the image frame and copying the expanded sample image based on the expanded number of samples; the copied expanded sample image is expanded into the training sample data set, and a target image generation model is trained based on the expanded training sample data set, so that the capability of the image generation model for processing local details is improved through reasonable expansion of image local sample data, and the problem of poor generation effect caused by small occupation of key parts in the image can be effectively solved.

Exemplary Programming

Based on the same inventive concept, corresponding to any of the above embodiments of the method, the present application further provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the training method of the image generation model according to any of the above embodiments.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The storage medium of the foregoing embodiments stores computer instructions for causing the computer to execute the training method of the image generation model according to any one of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.

Based on the same inventive concept, the present application also provides a computer program product, corresponding to the method of any of the embodiments described above, comprising a computer program. In some embodiments, the computer program instructions may be executed by one or more processors of a computer to cause the computer and/or the processor to perform the training method of the image generation model described in the above embodiments. Corresponding to the execution subject corresponding to each step in each embodiment of the scene editing method, the processor executing the corresponding step may belong to the corresponding execution subject.

The computer program product of the above embodiment is configured to enable the computer and/or the processor to perform the XX method according to any one of the above embodiments, and has the advantages of corresponding method embodiments, which are not described herein.

It can be appreciated that before using the technical solutions of the embodiments in the present application, the user is informed about the type, the use range, the use scenario, etc. of the related personal information in an appropriate manner, and the authorization of the user is obtained.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Therefore, the user can select whether to provide personal information to the software or hardware such as the electronic equipment, the application program, the server or the storage medium for executing the operation of the technical scheme according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization acquisition process is merely illustrative, and not limiting of the implementation of the present application, and that other ways of satisfying relevant legal regulations may be applied to the implementation of the present application.

It will be appreciated by those skilled in the art that embodiments of the present application may be implemented as a system, method, or computer program product. Thus, the present application may be embodied in the form of: all hardware, all software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software, is generally referred to herein as a "circuit," module, "or" system. Furthermore, in some embodiments, the present application may also be embodied in the form of a computer program product in one or more computer-readable media, which contain computer-readable program code.

Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive example) of the computer-readable storage medium could include, for example: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer, for example, through the internet using an internet service provider.

It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, in accordance with embodiments of the present application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the application (including the claims) is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the present application, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present application. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform on which the embodiments of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements and/or the like which are within the spirit and principles of the embodiments are intended to be included within the scope of the present application.

Claims

1. A method of training an image generation model, comprising:

2. The method of claim 1, wherein generating an extended sample image based on the image frame, comprises:

3. The method of claim 1, wherein determining the number of sample extensions based on a first ratio of the area of the image frame to the area of the object-class sample image, comprises:

acquiring preset expansion parameters;

4. The method according to claim 1, wherein training a target image generation model based on the augmented training sample dataset, in particular comprises:

5. The method according to claim 4, wherein calculating the functional loss of pixel point color values of the predicted image and the first sample image comprises:

6. The method according to claim 4, wherein determining the area of the target local area in the target class sample image comprises:

extracting a target mask corresponding to the target part;

7. The method of claim 4, wherein the plurality of target class sample images includes a first target class sample image obtained directly and a second target class sample image generated by the reference image generation model through a second text; the first type of sample image corresponds to a first sample, and the second sample text is generated based on the first sample.

8. The method according to claim 4, wherein training a target image generation model based on the augmented training sample dataset, in particular comprises:

acquiring the first sample image from the expanded training sample dataset;

9. The method according to claim 8, wherein the expanded training sample data set is divided into a plurality of sub-sample data sets based on the size of the first sample image, in particular comprising:

determining a target sub-sample data set corresponding to each first sample image; wherein a difference between the first aspect ratio of the target sub-sample dataset and the second aspect ratio of the corresponding first sample image is minimal;

10. The method of claim 9, wherein after scaling and cropping the first sample image, the method further comprises:

11. The method of claim 10, wherein a second number of the second sample images are generated by a reference image generation model corresponding to the target image generation model; the method specifically comprises the following steps:

acquiring a first sample text corresponding to the first sample image;

12. A method of training an image generation model, comprising:

13. The method according to claim 12, wherein the training sample data set is divided into a plurality of sub-sample data sets based on the size of the target class sample image, in particular comprising:

14. The method according to claim 13, wherein determining a plurality of sets of preset dimensions corresponding to the plurality of sub-sample data sets, in particular comprises:

15. The method according to claim 13, wherein scaling and cropping the target class sample image comprises:

16. The method of claim 13, wherein after determining the second aspect ratio instance corresponding to the target class sample image based on the size of the target class sample image, the method further comprises:

17. A training device for an image generation model, comprising:

18. A training device for an image generation model, comprising:

19. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor, the processor implementing the method of any one of claims 1 to 11 or 12 to 16 when the program is executed.

20. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 11 or 12 to 16.