CN115147526A

CN115147526A - Method and device for training clothing generation model and method and device for generating clothing image

Info

Publication number: CN115147526A
Application number: CN202210767948.4A
Authority: CN
Inventors: 杨少雄
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-10-04
Anticipated expiration: 2042-06-30
Also published as: CN115147526B

Abstract

The present disclosure provides a method and an apparatus for training a clothing generation model and generating clothing images, which relate to the technical field of artificial intelligence, specifically to the technical fields of Augmented Reality (AR), virtual reality, computer vision, deep learning, etc., and can be applied to scenes such as the meta-universe, etc. The specific implementation scheme is as follows: acquiring a sample of a clothing image and training a shape encoder and a texture encoder; respectively selecting a part of the shape features and the texture features obtained by the shape encoder and the texture encoder to splice to obtain splicing features; inputting the splicing characteristics into a pre-training model, and outputting a virtual clothing image; adjusting relevant parameters of a shape encoder, a texture encoder and a pre-training model based on a difference between the original image and the virtual apparel image; and obtaining a clothing generation model based on the adjusted shape encoder, the adjusted texture encoder and the pre-training model. The embodiment can obtain a model which can generate a dress image of a specified style.

Description

Method and device for training clothing generation model and method and device for generating clothing image

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of augmented reality AR, virtual reality, computer vision, deep learning and the like, can be applied to scenes such as the meta universe and the like, and particularly relates to a method and a device for training a clothing generation model and generating clothing images.

Background

In recent years, with the rapid development of computer technology, image processing technology is applied to various aspects. For example, a cartoon avatar apparel is personalized. The 2D cartoon virtual image clothing part is required to be generated according to the shot real person photo, the generated clothing is required to meet the shape of a given template, and the generated clothing keeps higher similarity with the original photo clothing.

The shape and the texture of the clothing image generated by the related technology are not controlled, and the high-similarity reconstruction of the clothing image with the shape and the texture of a specific style cannot be realized.

Disclosure of Invention

The present disclosure provides methods, apparatuses, devices, storage medium, and computer program products for training a apparel image generation model, generating apparel images.

According to a first aspect of the present disclosure, there is provided a training method of a clothing generation model, including: obtaining a sample image of the clothes, wherein the sample image comprises an original image, a shape mask image and a texture image; training a first generation model based on the shape mask image, wherein the first generation model comprises a pre-trained model and a shape encoder; training a second generative model based on the texture image, wherein the second generative model comprises a pre-trained model and a texture encoder; respectively obtaining shape features and texture features of the sample image through the shape encoder and the texture encoder, and respectively selecting a part from the shape features and the texture features to be spliced to obtain splicing features; based on the splicing characteristics, obtaining a virtual clothing image through the pre-training model; adjusting relevant parameters of the shape encoder, the texture encoder, and the pre-trained model based on differences between the original image and the virtual apparel image; and obtaining a clothing generation model based on the adjusted shape encoder, the adjusted texture encoder and the pre-training model.

According to a second aspect of the present disclosure, there is provided a method of generating an image of a garment, comprising: acquiring a shape image and a texture image; inputting the shape image and the texture image into a clothing generation model generated by the method of the first aspect, and generating a clothing image of a specified style.

According to a third aspect of the present disclosure, there is provided a training apparatus for a clothing generation model, comprising: an acquisition unit configured to acquire a sample image of a garment, wherein the sample image includes an original image, a shape mask image, and a texture image; a first training unit configured to train a first generation model based on the shape mask image, wherein the first generation model comprises a pre-training model and a shape encoder; a second training unit configured to train a second generative model based on the texture image, wherein the second generative model comprises a pre-trained model and a texture encoder; the splicing unit is configured to enable the sample images to respectively pass through the shape encoder and the texture encoder to obtain shape features and texture features, and select a part from the shape features and the texture features to be spliced respectively to obtain splicing features; a prediction unit configured to obtain a virtual clothing image through the pre-training model based on the stitching features; a third training unit configured to adjust relevant parameters of the shape encoder, the texture encoder, and the pre-trained model based on a difference between the original image and the virtual apparel image; an output unit configured to obtain a clothing generation model based on the adjusted shape encoder, texture encoder and pre-training model.

According to a fourth aspect of the present disclosure, there is provided an apparatus for generating an image of a garment, comprising: an acquisition unit for acquiring the data of the received signal, configured to acquire a shape image and a texture image; a generating unit configured to input the shape image and the texture image into a clothing generation model generated by the apparatus according to the third aspect, and generate a clothing image of a specified style.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first or second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first or second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first or second aspect.

The application provides a 2D clothing image generation technology based on shape and texture coding fusion control, the given clothing style shape and the texture image can reconstruct a high-quality and tidy clothing image with the given clothing style shape and high similarity with the input texture, and the 2D clothing image with controllable clothing style shape and texture is edited and generated. The techniques presented herein may be used for high-quality reconstruction generation of apparel parts for 2D avatars (short sleeves, long sleeves, trousers, shorts, skirts, etc.), as well as mass 2D apparel design and creation. In addition, the technology provided by the text can be used in a 2D virtual fitting solution, can be used for wearing clothes on the body of a model in any posture, and has wide application scenes and commercial value.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram to which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a training method for a clothing generation model according to the application;

FIG. 3 is a schematic diagram of an application scenario of a training method of a clothing generation model according to the application;

FIG. 4 is a flow diagram of one embodiment of a method of generating an image of apparel in accordance with the present application;

FIG. 5 is a schematic diagram of an embodiment of a training apparatus for a clothing generation model according to the present application;

FIG. 6 is a schematic diagram of the structure of one embodiment of an apparatus for generating an image of apparel in accordance with the present application;

figure 7 is a block diagram of an electronic device that is a method of training a clothing generation model and generating clothing images according to an embodiment of the application.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 illustrates an exemplary system architecture 100 to which a training method of a garment generation model, a training apparatus of a garment generation model, a method of generating a garment image, or an apparatus of generating a garment image of embodiments of the application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminals

101, 102, a network 103, a database server 104, and a server 105. The network 103 serves as a medium for providing communication links between the

terminals

101, 102, the database server 104 and the server 105. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user 110 may use the

terminals

101, 102 to interact with the server 105 over the network 103 to receive or send messages or the like. The

terminals

101 and 102 may have various client applications installed thereon, such as a model training application, a clothing image editing application, a virtual fitting application, a shopping application, a payment application, a web browser, an instant messenger, and the like.

Here, the

terminals

101 and 102 may be hardware or software. When the

terminals

101 and 102 are hardware, they may be various electronic devices with display screens, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III), laptop portable computers, desktop computers, and the like. When the

terminals

101 and 102 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

When the

terminals

101, 102 are hardware, an image capturing device may be further mounted thereon. The image acquisition device can be various devices capable of realizing the function of acquiring images, such as a camera, a sensor and the like. The user 110 may capture some apparel images using an image capture device on the

terminal

101, 102.

Database server 104 may be a database server that provides various services. For example, a database server may have a sample set stored therein. The sample set contains a large number of samples. The samples may include, among other things, an original image, a shape mask image, and a texture image. In this way, the user 110 may also select samples from a set of samples stored by the database server 104 via the

terminals

101, 102.

The server 105 may also be a server providing various services, such as a background server providing support for various applications displayed on the

terminals

101, 102. The background server may train the initial model using the samples in the sample set sent by the

terminals

101 and 102, and may send the training result (e.g., the generated clothing generation model) to the

terminals

101 and 102. In this way, the user can apply the generated clothing generation model to carry out clothing design, and clothing images with specified shapes and textures are generated.

Here, the database server 104 and the server 105 may be hardware or software. When they are hardware, they can be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the training method of the clothing generation model or the method of generating the clothing image provided in the embodiment of the present application is generally executed by the server 105. Accordingly, the number of the first and second electrodes, training device for generating model of clothing or generating clothing the means of the image is also typically located in the server 105.

It is noted that database server 104 may not be provided in system architecture 100, as server 105 may perform the relevant functions of database server 104.

It should be understood that the number of terminals, networks, database servers, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, database servers, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a training method for a clothing generation model in accordance with the present application is shown. The training method of the clothing generation model comprises the following steps:

step 201, a sample image of the clothing is obtained.

In this embodiment, an executive body (e.g., a server shown in fig. 1) of the training method of the clothing generation model may obtain the sample image set in various ways. For example, the executive may obtain the existing sample image set stored therein from a database server (e.g., database server 104 shown in fig. 1) via a wired connection or a wireless connection. As another example, a user may collect a sample image via a terminal (e.g.,

terminals

101, 102 shown in FIG. 1). In this way, the executive may receive sample images collected by the terminal and store them locally, thereby generating a sample image set.

Here, the sample image set may include at least one sample image. Wherein the sample image includes an original image, a shape mask image, and a texture image. Sample images are selected from the sample image set, and steps 202-207 are performed, wherein the selection manner and the number of the sample images are not limited in the present application. For example, at least one sample image may be selected randomly, or a sample image with better sharpness (i.e., higher pixels) may be selected from the sample images.

The original image is a color image that includes the apparel. The shape mask image is a black and white image of the outline of the garment, also referred to as the shape mask. The clothing shape mask can be extracted from the original image through algorithms such as image semantic segmentation and the like. The texture image is a color image including the texture and color of the apparel, such as a white background blue flower texture image. The texture image can be a clothing fragment cut out from the original image at will, and the shape of the texture image can be any shape. The texture image can also be an image obtained by randomly shielding an original image, and is used for simulating the texture of the clothes shielded by the arms of the user to form a defective texture image.

Step 202, training the first generation model based on the shape mask image.

In this embodiment, the first generation model includes a pre-trained model and a shape encoder. The shape encoder is a convolutional neural network for extracting shape features of an image. Each layer of the shape encoder outputs shape features. The pre-trained model may be a generator of GAN. The training process of the pre-training model is as follows: acquiring a dress true image (dress GT) as a truth label; randomly generating some vectors, inputting the vectors into a generator, outputting a predicted image, judging whether the predicted image and the clothing true image are true or false by a discriminator, alternately adjusting parameters of the generator and the discriminator, and finally obtaining the generator as a pre-training model.

The training process for the shape encoder is as follows: the shape features of the shape mask image in the sample can be extracted through a shape encoder in the first generation model, then the shape features are input into a pre-training model in the first generation model, and a predicted clothing image is output. And performing image semantic segmentation on the predicted clothing image to obtain a segmented image. Then, the similarity between the segmentation image and the shape mask image is calculated, and if the similarity is smaller than a preset value, relevant parameters of the shape encoder are adjusted. The similarity calculation method can adopt the common method in the prior art, and is not limited to cosine distance, euclidean distance and the like. The relevant parameters of the pre-training model can be fixed in the process of training the shape encoder, and the relevant parameters of the pre-training model and the shape encoder can also be adjusted at the same time.

And after relevant parameters of the shape encoder are adjusted, selecting samples again, and executing the training process again until the similarity is larger than a preset value.

Step 203, training the second generative model based on the texture image.

In this embodiment, the second generative model comprises a pre-trained model and a texture encoder. The texture encoder is a convolutional neural network for extracting texture features of an image. Each layer of the texture encoder outputs texture features. The number of layers of the texture encoder and the shape encoder may be the same or different.

The training process of the texture encoder is as follows: the shape features of the shape mask image in the sample can be extracted through a texture encoder in the second generation model, then the pre-training model in the second generation model is input, and the predicted clothing image is output. And then calculating the similarity of the predicted clothing image and the predicted texture image, and if the similarity is smaller than a preset value, adjusting the related parameters of the texture encoder. The similarity calculation method can adopt a common method in the prior art, is not limited to cosine distance, euclidean distance and the like, and can measure the similarity of two images from two aspects of pixel loss and perception loss. The relevant parameters of the pre-training model can be fixed in the process of training the texture encoder, and the relevant parameters of the pre-training model and the texture encoder can also be adjusted at the same time.

And after the relevant parameters of the texture encoder are adjusted, selecting the samples again, and executing the training process again until the similarity is greater than a preset value.

And 204, respectively obtaining shape characteristics and texture characteristics of the sample image through a shape encoder and a texture encoder, and respectively selecting a part from the shape characteristics and the texture characteristics to splice to obtain splicing characteristics.

In this embodiment, a multi-layer shape feature and texture feature are obtained during the training process of the shape encoder and the texture encoder. Shape features of a first number of levels may be selected from the plurality of levels of shape features, texture features of a second number of levels may be selected from the plurality of levels of texture features, and a sum of the first number and the second number may be equal to an input number of levels of the pre-training model. For example, 8 layers of shape features and 10 layers of texture features are selected and spliced into 18 layers of spliced features. Or each select 9 layers of features.

And step 205, obtaining a virtual clothing image through a pre-training model based on the splicing characteristics.

In this embodiment, the pre-training model is a generator of GAN, and the virtual clothing image can be predicted by stitching features.

At step 206, relevant parameters of the shape encoder, texture encoder, and pre-trained model are adjusted based on the difference between the original image and the virtual apparel image.

In this embodiment, the difference between the original image of the sample and the virtual clothing image may be calculated in various ways, and the similarity may be calculated by a common method of calculating image similarity. If the similarity is smaller than the preset similarity threshold, adjusting the relevant parameters of 3 models including the shape encoder, the texture encoder and the pre-training model, so that the difference between the original image and the virtual clothing image is reduced until the preset similarity threshold is converged.

Step 207, obtaining a clothing generation model based on the adjusted shape encoder, texture encoder and pre-training model.

In this embodiment, if the difference between the original image of the sample and the virtual clothing image is smaller than the predetermined value, or the similarity is greater than the predetermined similarity threshold, it indicates that the training of the shape encoder, the texture encoder, and the pre-training model is completed. May together make up a apparel generative model. The apparel generation model may be published to a server or terminal device.

The embodiment provided by the application solves the technical problem of generating the 2D high-precision clothing image with the specified shape and texture. The technical scheme provided by the application can be divided into three stages, wherein the first stage is to train a clothing generator model firstly and can generate clothing images with unspecific arbitrary shapes and textures, the second stage is to train a shape encoder model and a texture encoder model respectively, and the third stage is to fuse the shape encoder model encoding and the texture encoder model encoding and feed the clothing generator model to generate the 2D clothing images which accord with the given shapes and textures.

The technical scheme provided by the document solves the problem of generating the clothing image under the constraint condition of shape and texture, can generate the 2D high-quality cartoon virtual image clothing part based on single photo input, and can realize the mass production creation of 2D clothing digital assets. Moreover, the technology provided by the text can be used in a 2D virtual fitting solution, and has a wide application scene.

In some optional implementations of this embodiment, training a shape encoder based on the shape mask image and a pre-training model includes: inputting the shape mask image into a shape encoder to obtain shape features; inputting the shape features into a pre-training model to generate a predicted clothing image; performing image semantic segmentation on the predicted clothing image to obtain a segmentation mask image; adjusting relevant parameters of the shape encoder based on the segmentation mask map and the shape loss of the shape mask image, wherein the relevant parameters of the pre-trained model are fixed during training of the shape encoder. The loss of shape here is the loss of distance L1. Image Semantic Segmentation (Semantic Segmentation) is an important ring in image processing and machine vision technology with respect to image understanding, and is also an important branch in the AI field. The semantic segmentation is to classify each pixel point in the image, determine the category (such as belonging to the background, people or vehicles) of each point, and thus perform region division. The virtual clothing image can be subjected to image semantic segmentation through a common semantic segmentation model in the prior art, and the obtained segmentation mask image can display the clothing outline but can not display the color and the texture. Therefore, the accuracy of the shape and the texture of the clothes generated by the trained model can be improved.

In some optional implementations of this embodiment, the method further includes: inputting the predicted clothing image into a discriminator of the pre-training model to obtain a discrimination result; and adjusting relevant parameters of the shape encoder based on the discrimination result. The pre-training model is a generator of GAN, which also corresponds to the arbiter. The discriminator is used for discriminating whether the input clothing image is the generated fake image. The judgment result is the probability of the true image. The discrimination loss is calculated from the discrimination result and the actually input image (1 is a true image, and 0 is a false image). The calculation of the discrimination loss adopts the prior art, and therefore, the description is not repeated. The relevant parameters of the shape encoder are adjusted by discriminant loss. So that the discriminator cannot discriminate the authenticity of the picture.

In some optional implementations of this embodiment, training a texture encoder based on the texture image and a pre-training model includes: inputting the texture image into a texture encoder to obtain texture features; inputting the texture features into a pre-training model to generate a predicted clothing image; and adjusting relevant parameters of a texture encoder based on the 2D distance loss and the perception loss of the predicted clothing image and the texture image, wherein the relevant parameters of a pre-trained model are fixed in the process of training the texture encoder. The 2D distance loss is the L2 loss. Learning perceived Image block Similarity (LPIPS), also known as "Perceptual loss", is used to measure the difference between two images. The accuracy of calculating the image difference anisotropy can be improved by introducing the perception loss, so that a more accurate clothing generation model is generated.

In some optional implementations of this embodiment, the obtaining a mosaic feature by stitching a selected part of the shape feature and the texture feature obtained by the shape encoder and the texture encoder respectively for the sample includes: selecting half of the predetermined number of low-layer features from the features output from the shape encoder; selecting half of the high-level features with a predetermined number of layers from the features output by the texture encoder; and splicing the low-layer features and the high-layer features into splicing features with a preset number of layers. The number of input layers of the pre-training model is a preset number of layers, and half of the preset number of layers of features can be selected from the shape features and the texture features. Assuming a total of 18 layers, the shape features of 1-8 layers and the texture features of 9-18 layers are selected for stitching. The features of the other layers are discarded for use. This is because the shape features of the lower layers can be well expressed, while the texture features require the features of the higher layers to be better expressed. Therefore, the combination mode of the low-layer shape features and the high-layer texture features can more accurately represent the features of the sample, so that an accurate clothing generation model is trained, and the similarity between the shape and the texture of the generated clothing image and the specified shape and the texture of the image are improved.

In some optional implementations of this embodiment, the adjusting the relevant parameters of the shape encoder, the texture encoder and the pre-trained model based on the difference between the original image of the sample and the virtual clothing image includes: calculating a 2D distance loss between the original image of the sample and the virtual apparel image; calculating a perceptual loss between the original image of the sample and the virtual apparel image; taking a weighted sum of the 2D distance penalty and the perceptual penalty as a first penalty value; and if the first loss value is larger than or equal to a first preset threshold value, adjusting relevant parameters of the shape encoder, the texture encoder and the pre-training model. The 2D distance loss is the L2 loss. Learning Perceptual Image Patch Similarity (LPIPS), also known as "Perceptual loss", is used to measure the difference between two images. The accuracy of calculating the image difference anisotropy can be improved by introducing the perception loss, so that a more accurate clothing generation model is generated.

In some optional implementations of this embodiment, the method further includes: performing image semantic segmentation on the virtual clothes image to obtain a segmentation mask image; calculating a distance L1 penalty between the segmentation mask map and the shape mask image of the sample; taking a weighted sum of the distance L1 penalty, the 2D distance penalty, and the perceptual penalty as a second penalty value; and if the second loss value is larger than or equal to a second preset threshold value, adjusting relevant parameters of the shape encoder, the texture encoder and the pre-training model. The total loss is calculated through the three losses, so that the similarity of shapes and the similarity of textures can be guaranteed, and a more accurate clothes generation model is generated.

In some optional implementations of this embodiment, the obtaining a sample of an image of a piece of apparel includes: acquiring an original image of a clothing image; performing image semantic segmentation on the original image to obtain a shape mask image; randomly shielding the original image to obtain a texture image; and combining the original image, the corresponding shape mask image and the corresponding texture image into a sample. The texture image obtained in this way is a defective texture image, so the generated clothing generation model can repair the defective texture to obtain a complete clothing image, for example, if the texture image in the training sample is covered by a part of buttons (for example, 5 buttons in total, and 2 buttons in total) by the arm, then the clothing image of all buttons can be generated by the clothing generation model.

In some optional implementations of this embodiment, the pre-training model is a generator of a generative confrontation network, and the shape encoder and the texture encoder are convolutional neural networks with the same number of layers. The generator in StyleGAN can be employed as a pre-trained model. The network structure is convenient to construct and train and has good performance.

With continuing reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for training a clothing generation model according to the present embodiment. In the application scenario of fig. 3, the clothing generation model includes a shape encoder, a texture encoder, and a generator. The generator is trained in a clothing data set based on a StyleGAN network to obtain a pre-training model. The specific training process is as follows:

1. firstly, collecting a large amount of 2D clothing image data, and carrying out scale alignment treatment;

2. then, clothing image segmentation is carried out on the aligned 2D clothing images, and clothing shape masks are extracted to obtain shape Mask images;

3. training in the first stage: training a StyleGAN clothing image generator (512 × 512 resolution) based on a large number of aligned clothing pictures to obtain a pre-training model;

4. and a second stage of training: adding a shape encoder and a StyleGAN image generator to train a shape encoder model independently; a texture encoder model is separately trained by adding a texture encoder and a StyleGAN clothing image generator (wherein the StyleGAN clothing generator is frozen in the training of both encoders, i.e. does not participate in the training of both encoders, and only serves as a clothing generator to respectively train a shape encoder and a texture encoder);

5. the third stage training: adding a shape encoder and a texture encoder, fusing the outputs of the two encoders (the fusion ratio is 7;

6. 1) the first stage training loss function is the generic StyleGAN corresponding GAN loss function;

2) In the second stage of training, the shape encoder loss function comprises two functions, one is GAN discrimination loss, and the other is shape loss (Mask distance L1 loss obtained by inputting Mask and generating dress image segmentation); the texture encoder loss function includes two, one is a pixel level loss (2D distance loss between apparel GT and the generated apparel image) and the second is an LPIPS apparel perception loss (LPIPS perception distance loss between apparel GT and the generated apparel image).

3) In the third stage of training, the loss functions include three, one is shape loss (Mask distance L1 loss resulting from the segmentation of the input Mask and the generated garment image), the second is pixel-level loss (2D distance loss between garment GT and the generated garment image), and the third is LPIPS garment perception loss (LPIPS perceived distance loss between garment GT and the generated garment image).

7. After the third-stage training is finished, inputting a target shape Mask and partial incomplete clothing texture during prediction, and obtaining a 2D clothing image which is consistent with the input shape and has similar texture to the input texture.

With continued reference to FIG. 4, a flow 400 of yet another embodiment of a method of generating an image of apparel in accordance with the present application is shown. The method for generating the clothing image can comprise the following steps:

step 401, obtaining a shape image and a texture image of a garment of a specified style.

In the present embodiment, the execution subject (e.g., the server 105 shown in fig. 1) of the method of generating a clothing image may specify the shape image and the texture image of the clothing of the style in various ways. For example, the execution subject may obtain the shape image and the texture image of the apparel of the specified style stored therein from a database server (e.g., database server 104 shown in fig. 1) through a wired connection manner or a wireless connection manner. As another example, the executing subject may also receive a shape image and a texture image of a dress of a specified style, which are collected by a terminal (e.g., the

terminals

101, 102 shown in fig. 1) or other device. For example, a texture image of a short-sleeved T-shirt having the shape of a long-sleeved coat and a yellow little-star pattern is specified.

Step 402, inputting the shape image and the texture image into a clothing generation model to generate a clothing image with a specified style.

In this embodiment, the executing subject may input the image acquired in step 401 into a clothing generation model, thereby generating a clothing image of a specified style, for example, a long-sleeve windcheat of a yellow little-star pattern.

In this embodiment, the apparel generation model may be generated using the method described above in the embodiment of fig. 2. For a specific generation process, reference may be made to the related description of the embodiment in fig. 2, which is not described herein again.

It should be noted that the method for generating a clothing image according to this embodiment may be used to test the clothing generation model generated according to the foregoing embodiments. And then the clothing generation model can be continuously optimized according to the generated clothing image. The method may also be a practical application method of the clothing generation model generated by the above embodiments. By generating a clothing image using the clothing generation model generated in each of the above embodiments, a clothing image of a predetermined shape and texture can be generated.

With continued reference to FIG. 5, as an implementation of the method shown in FIG. 2 above, the present application provides an embodiment of a training apparatus for a clothing generation model. The embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device can be applied to various electronic devices.

As shown in fig. 5, the training apparatus 500 of the clothing generation model of the present embodiment may include: the device comprises an acquisition unit 501, a first training unit 502, a second training unit 503, a splicing unit 504, a prediction unit 505, a third training unit 506 and an output unit 507. The acquiring unit 501 is configured to acquire a sample image of a garment, where the sample image includes an original image, a shape mask image, and a texture image; a first training unit 502 configured to train a first generation model based on the shape mask image, wherein the first generation model comprises a pre-training model and a shape encoder; a second training unit 503 configured to train a second generative model based on the texture image, wherein the second generative model comprises a pre-trained model and a texture encoder; a stitching unit 504, configured to obtain shape features and texture features of the sample image through the shape encoder and the texture encoder, respectively, and select a part of the shape features and the texture features for stitching, respectively, to obtain a stitching feature; a prediction unit 505 configured to obtain a virtual clothing image through the pre-training model based on the stitching features; a third training unit 506 configured to adjust relevant parameters of the shape encoder, the texture encoder, and the pre-trained model based on a difference between the original image and the virtual apparel image; an output unit 507 configured to obtain a clothing generation model based on the adjusted shape encoder, texture encoder and pre-training model.

In some optional implementations of this embodiment, the first training unit 502 is further configured to: inputting the shape mask image into a shape encoder in the first generation model to obtain shape features; inputting the shape features into a pre-training model in the first generation model to generate a predicted clothing image; performing image semantic segmentation on the predicted clothing image to obtain a segmentation mask image; adjusting relevant parameters of the shape encoder based on the segmentation mask map and the shape loss of the shape mask image, wherein the relevant parameters of the pre-trained model are fixed during training of the shape encoder.

In some optional implementations of the present embodiment, the first training unit 502 is further configured to: inputting the predicted clothing image into a discriminator of the pre-training model to obtain a discrimination result; and adjusting relevant parameters of the shape encoder based on the discrimination result.

In some optional implementations of this embodiment, the second training unit 503 is further configured to: inputting the texture image into a texture encoder in the second generation model to obtain texture features; inputting the texture features into a pre-training model in the second generation model to generate a predicted clothing image; and adjusting relevant parameters of a texture encoder based on the 2D distance loss and the perception loss of the predicted clothing image and the texture image, wherein the relevant parameters of a pre-training model are fixed in the process of training the texture encoder.

In some optional implementations of this embodiment, the splicing unit 504 is further configured to: selecting half of the low-layer features with the preset layer number from the shape features output by the shape encoder; selecting one half of high-level features with preset layers from texture features output by the texture encoder; and splicing the low-layer features and the high-layer features into splicing features with a preset number of layers.

In some optional implementations of this embodiment, the third training unit 506 is further configured to: calculating a 2D distance loss between an original image and the virtual apparel image; calculating a perceptual loss between an original image and the virtual apparel image; taking a weighted sum of the 2D distance penalty and the perceptual penalty as a first penalty value; and if the first loss value is larger than or equal to a first preset threshold value, adjusting relevant parameters of the shape encoder, the texture encoder and the pre-training model.

In some optional implementations of this embodiment, the third training unit 505 is further configured to: performing image semantic segmentation on the virtual clothing image to obtain a segmentation mask image; calculating a distance L1 loss between the segmentation mask map and the shape mask image; taking a weighted sum of the distance L1 penalty, the 2D distance penalty, and the perceptual penalty as a second penalty value; and if the second loss value is greater than or equal to a second preset threshold value, adjusting relevant parameters of the shape encoder, the texture encoder and the pre-training model.

With continued reference to FIG. 6, the present application provides one embodiment of an apparatus for generating an image of apparel as an implementation of the method illustrated in FIG. 4 and described above. The embodiment of the device corresponds to the embodiment of the method shown in fig. 4, and the device can be applied to various electronic devices.

As shown in fig. 6, the apparatus 600 for generating a clothing image of the present embodiment may include: an acquisition unit 601 and a generation unit 602. Wherein the obtaining unit 601 is configured to obtain a shape image and a texture image; the generating unit 602 is configured to input the shape image and the texture image into the clothing generation model generated by the apparatus 500, and generate a clothing image of a specified style.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of

flows

200 or 400.

A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of

flow

200 or 400.

A computer program product comprising a computer program which, when executed by a processor, implements the method of

flow

200 or 400.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as a training method of a clothing generation model. For example, in some embodiments, the training method of the apparel generation model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM 703 and executed by the computing unit 701, may perform one or more steps of the training method of the apparel generation model described above. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the training method of the apparel generation model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A training method for a clothing generative model, comprising:

obtaining a sample image of the clothes, wherein the sample image comprises an original image, a shape mask image and a texture image;

training a first generation model based on the shape mask image, wherein the first generation model comprises a pre-trained model and a shape encoder;

training a second generative model based on the texture image, wherein the second generative model comprises a pre-trained model and a texture encoder;

respectively obtaining shape features and texture features of the sample image through the shape encoder and the texture encoder, and respectively selecting a part from the shape features and the texture features to splice to obtain splicing features;

based on the splicing characteristics, obtaining a virtual clothing image through the pre-training model;

adjusting relevant parameters of the shape encoder, the texture encoder, and the pre-trained model based on differences between the original image and the virtual apparel image;

and obtaining a clothing generation model based on the adjusted shape encoder, the adjusted texture encoder and the pre-training model.

2. The method of claim 1, wherein training a first generation model based on the shape mask image comprises:

inputting the shape mask image into a shape encoder in the first generative model to obtain shape features;

inputting the shape features into a pre-training model in the first generation model to generate a predicted clothing image;

performing image semantic segmentation on the predicted clothing image to obtain a segmentation mask image;

adjusting relevant parameters of the shape encoder based on the segmentation mask map and the shape loss of the shape mask image, wherein the relevant parameters of the pre-trained model are fixed during training of the shape encoder.

3. The method of claim 2, further comprising:

inputting the predicted clothing image into a discriminator of the pre-training model to obtain a discrimination result;

and adjusting relevant parameters of the shape encoder based on the discrimination result.

4. The method of claim 1, wherein training a second generative model based on the texture image comprises:

inputting the texture image into a texture encoder in the second generation model to obtain texture features;

inputting the texture features into a pre-training model in the second generation model to generate a predicted clothing image;

and adjusting relevant parameters of a texture encoder based on the 2D two-dimensional distance loss and the perception loss of the predicted clothing image and the texture image, wherein the relevant parameters of a pre-trained model are fixed in the process of training the texture encoder.

5. The method of claim 1, wherein the selecting a portion of the shape feature and the texture feature for stitching to obtain a stitched feature comprises:

selecting one half of the low-layer characteristics with the preset number of layers from the shape characteristics output by the shape encoder;

selecting one half of high-level features with preset layers from texture features output by the texture encoder;

and splicing the low-layer features and the high-layer features into splicing features with a preset number of layers.

6. The method of claim 1, wherein the adjusting the relevant parameters of the shape encoder, the texture encoder, and the pre-trained model based on the difference between the original image and the virtual apparel image comprises:

calculating a 2D distance loss between the original image and the virtual apparel image;

calculating a perception loss between the original image and the virtual apparel image;

taking a weighted sum of the 2D distance penalty and the perceptual penalty as a first penalty value;

and if the first loss value is larger than or equal to a first preset threshold value, adjusting relevant parameters of the shape encoder, the texture encoder and the pre-training model.

7. The method of claim 6, further comprising:

performing image semantic segmentation on the virtual clothing image to obtain a segmentation mask image;

calculating a distance L1 loss between the segmentation mask map and the shape mask image;

taking a weighted sum of the distance L1 penalty, the 2D distance penalty, and the perceptual penalty as a second penalty value;

and if the second loss value is larger than or equal to a second preset threshold value, adjusting relevant parameters of the shape encoder, the texture encoder and the pre-training model.

8. A method of generating an image of apparel, comprising:

acquiring a shape image and a texture image;

inputting the shape image and the texture image into a clothing generation model generated by the method according to any one of claims 1 to 7, and generating a clothing image of a specified style.

9. A training apparatus for a clothing generative model, comprising:

an acquisition unit configured to acquire a sample image of a garment, wherein the sample image includes an original image, a shape mask image, and a texture image;

a first training unit configured to train a first generation model based on the shape mask image, wherein the first generation model comprises a pre-training model and a shape encoder;

a second training unit configured to train a second generative model based on the texture image, wherein the second generative model comprises a pre-trained model and a texture encoder;

the splicing unit is configured to enable the sample image to obtain shape features and texture features through the shape encoder and the texture encoder respectively, and select a part from the shape features and the texture features respectively to be spliced to obtain splicing features;

a prediction unit configured to obtain a virtual clothing image through the pre-training model based on the stitching features;

a third training unit configured to adjust relevant parameters of the shape encoder, the texture encoder, and the pre-trained model based on a difference between the original image and the virtual apparel image;

an output unit configured to obtain a clothing generation model based on the adjusted shape encoder, texture encoder and pre-training model.

10. The apparatus of claim 9, the first training unit is further configured to:

inputting the shape mask image into a shape encoder in the first generation model to obtain shape features;

adjusting a relevant parameter of the shape encoder based on the segmentation mask map and the shape loss of the shape mask image, wherein the relevant parameters of the pre-trained model are fixed during the training of the shape encoder.

11. The apparatus of claim 10, wherein the first training unit is further configured to:

12. The apparatus of claim 9, wherein the second training unit is further configured to:

and adjusting relevant parameters of a texture encoder based on the 2D distance loss and the perception loss of the predicted clothing image and the texture image, wherein the relevant parameters of a pre-training model are fixed in the process of training the texture encoder.

13. The apparatus of claim 9, wherein the splicing unit is further configured to:

selecting half of the low-layer features with the preset layer number from the shape features output by the shape encoder;

14. The apparatus of claim 9, wherein the third training unit is further configured to:

calculating a perceptual loss between the original image and the virtual apparel image;

and if the first loss value is greater than or equal to a first preset threshold value, adjusting relevant parameters of the shape encoder, the texture encoder and the pre-training model.

15. The apparatus of claim 14, wherein the third training unit is further configured to:

calculating a distance L1 penalty between the segmentation mask map and the shape mask image;

16. An apparatus for generating an image of apparel, comprising:

an acquisition unit configured to acquire a shape image and a texture image;

a generating unit configured to input the shape image and the texture image into a clothing generation model generated by the apparatus according to any one of claims 9 to 15, and generate a clothing image of a specified style.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.