CN112232914B

CN112232914B - Four-stage virtual fitting method and device based on 2D image

Info

Publication number: CN112232914B
Application number: CN202011116951.7A
Authority: CN
Inventors: 彭涛; 常源; 刘军平; 胡新荣; 何儒汉; 张俊杰; 张自力; 姜明华
Original assignee: Wuhan Textile University
Current assignee: Wuhan Textile University
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2023-04-18
Anticipated expiration: 2040-10-19
Also published as: CN112232914A

Abstract

The invention provides a four-stage virtual fitting method and a four-stage virtual fitting device based on a 2D image, which comprise the following steps: acquiring a reference person image and a target clothes image; extracting a human body part semantic segmentation graph and a posture graph from a reference human image, and fusing a clothes area and an arm area in the human body part semantic segmentation graph to obtain a clothes-irrelevant fusion graph; warping the target clothes image to obtain a warped clothes image; predicting a semantic segmentation map of the target clothes in the image of the target clothes worn by the reference person according to the clothes-independent fusion map and the distorted clothes image; generating an arm image according to the predicted semantic segmentation graph and the gesture graph; and generating a fitting synthetic image according to the distorted clothes image, the predicted semantic segmentation image and the arm image, completing the four-stage virtual fitting based on the 2D image, solving the problems that the image visual quality is seriously influenced by the non-coincidence artifact and the like in the prior art, and improving the trueness of the fitting effect.

Description

Four-stage virtual fitting method and device based on 2D image

Technical Field

The invention relates to the technical field of computers and networks, in particular to a four-stage virtual fitting method and a four-stage virtual fitting device.

Background

With the development of computer technology and online shopping platforms, online clothing shopping is remarkably increased, and compared with traditional shopping, online shopping has many advantages such as convenience and rapidness, but because the online clothing shopping can not try on in real time, check clothing and buy after the clothing, and the like in a physical store clothing shopping, some people still can choose to buy clothing in the physical store in order to avoid repeated return of goods, it is more and more important to establish an interactive shopping environment close to reality, and virtual clothing fitting also draws wide attention.

In the early stage, virtual fitting is mainly realized based on a 3D method, but the 3D method needs to rely on computer graphics to construct a 3D module and render fitting images, and 3D data is difficult to obtain, so that a large amount of manpower, material resources and financial resources are needed, and the application of virtual fitting in practice is greatly limited. Image-based visual fitting has recently been proposed by the skilled person, attempting to convert virtual fitting into a conditional image generation problem, and showing encouraging results. At present, many organizations at home and abroad are making relevant researches, but much attention is paid to how to better generate details of target clothes, and the corresponding relation between the target clothes and a reference person is ignored, so that the generated try-on synthetic image has the problem that the visual quality of the image is seriously influenced by the fact that artifacts are not consistent; meanwhile, the retention of other fitting irrelevant details is ignored, the fitting composite image is blurred or even shielded by clothes, and the fitting effect is seriously influenced.

Disclosure of Invention

The invention aims to provide a four-stage virtual fitting method and device based on a 2D image, which effectively solve the problem that the fitting image synthesized in the existing virtual fitting is not consistent with an artifact and the like, which seriously affects the visual quality of the image.

The technical scheme provided by the invention is as follows:

a four-stage virtual fitting method based on 2D images comprises the following steps:

acquiring a reference person image and a target clothes image;

extracting a human body part semantic segmentation graph and a posture graph from the reference human image, and fusing a clothes area and an arm area in the human body part semantic segmentation graph to obtain a clothes irrelevant fusion graph;

warping the target clothes image to obtain a warped clothes image;

predicting a semantic segmentation map of the target clothes in the image of the target clothes worn by the reference person according to the clothes-independent fusion map and the distorted clothes image;

generating an arm image according to the predicted semantic segmentation graph and the gesture graph;

and generating a fitting composite image according to the distorted clothes image, the predicted semantic segmentation image and the arm image, and finishing the four-stage virtual fitting based on the 2D image.

The invention also provides a four-stage virtual fitting device based on the 2D image, which comprises the following components:

the image acquisition module is used for acquiring a reference person image and a target clothes image;

the clothes irrelevant fusion image generation module is used for extracting a human body part semantic segmentation image and a posture image from the reference person image and fusing a clothes area and an arm area in the human body part semantic segmentation image to obtain a clothes irrelevant fusion image;

the clothes distortion module is used for distorting the target clothes image to obtain a distorted clothes image;

the semantic segmentation map generation module is used for predicting a semantic segmentation map of the target clothes in the target clothes image worn by the reference person according to the clothes-independent fusion map and the distorted clothes image;

the arm image generation module is used for generating an arm image according to the predicted semantic segmentation graph and the gesture graph;

and the fitting synthesis module is used for generating a fitting synthesis image according to the distorted clothes image, the predicted semantic segmentation image and the arm image so as to complete four-stage virtual fitting based on the 2D image.

The invention also provides terminal equipment which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor realizes the steps of the four-stage virtual fitting method based on the 2D image when running the computer program.

The invention also provides a computer readable storage medium, which stores a computer program, and the computer program is executed by a processor to realize the steps of any one of the four-stage virtual fitting method based on the 2D image.

The four-stage virtual fitting method and device based on the 2D image, provided by the invention, can at least bring the following beneficial effects:

1. the semantic segmentation graph of the target clothes is predicted by using the distorted clothes image and the clothes-independent fusion graph extracted from the reference person image, so that the predicted semantic segmentation graph is used for guiding generation of arm images changed after fitting and guiding synthesis of fitting images, the problems that image visual quality is seriously affected due to the fact that artifacts are not matched and the like in the prior art are solved, and the truth of fitting effects is improved.

2. The palm image extracted from the original reference person image according to the joint points and the arm mask extracted from the prediction semantic segmentation graph are used for generating the arm image changed after fitting, so that the generated fitting composite image can generate a correct and complete arm, the details of a non-fitting area of the reference person are completely reserved, the fitting effect is closer to reality, and the user experience is greatly improved.

3. The method has the advantages that a strategy of combining pixel distortion and feature distortion is used in the process of distorting the clothes, so that more natural and vivid distorted clothes shapes and textures are generated, meanwhile, the robustness of deformation, rotation and shielding is improved, and the problem of distortion of the shape distortion textures when the pixel distortion is used independently is solved.

4. For the extraction of the detail map of the non-try-on area, not only the images of the face and the hair are considered, but also other areas except the try-on area are considered, for example, when the jacket is tried on, the trousers area is taken as the non-try-on area, so that the detail of the non-try-on clothes area is completely reserved, and the effect is further improved.

Drawings

The foregoing features, technical features, advantages and implementations of which will be further described in the following detailed description of the preferred embodiments in a clearly understandable manner in conjunction with the accompanying drawings.

FIG. 1 is a schematic flow chart of a four-stage virtual fitting method based on 2D images according to the present invention;

FIG. 2 is a schematic view illustrating a pixel warping process performed on a target clothes image according to the present invention;

FIG. 3 is a schematic view illustrating a process of simultaneously performing pixel warping and feature warping on a target garment image according to the present invention;

FIG. 4 is a flow chart of semantic segmentation graph generation in the present invention;

FIG. 5 is a flowchart illustrating the generation of an arm image according to the present invention;

FIG. 6 is a flowchart of a try-on composite image generation process of the present invention;

FIG. 7 is a schematic structural diagram of a four-stage virtual fitting device based on 2D images according to the present invention;

FIG. 8 is a comparison of fitting results based on the integrity of the resulting garment in accordance with an embodiment of the present invention;

FIG. 9 is a comparison graph of fitting effects based on the integrity of the generated arms in an example of the present invention;

FIG. 10 is a comparison graph of the effect based on the degree of fit of the garment/visual quality of the overall fit composite image in one embodiment of the present invention;

fig. 11 is a schematic structural diagram of a terminal device in the present invention.

The reference numbers illustrate:

100-a four-stage virtual fitting device based on a 2D image, 110-an image acquisition module, 120-a clothes irrelevant fusion image generation module, 130-a clothes distortion module, 140-a semantic segmentation image generation module, 150-an arm image generation module and 160-a fitting synthesis module.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, specific embodiments of the present invention will be described below with reference to the accompanying drawings. It is to be understood that the drawings in the following description are merely exemplary of the invention and that other drawings and embodiments may be devised by those skilled in the art without the use of inventive faculty.

Fig. 1 shows a four-stage virtual fitting method based on 2D images, which includes:

s10, acquiring a reference person image and a target clothes image;

s20, extracting a human body part semantic segmentation graph and a posture graph from the reference human image, and fusing a clothes area and an arm area in the human body part semantic segmentation graph to obtain a clothes irrelevant fusion graph;

s30, warping the target clothes image to obtain a warped clothes image;

s40, predicting a semantic segmentation graph of the target clothes in the target clothes image worn by the reference person according to the clothes irrelevant fusion graph and the distorted clothes image;

s50, generating an arm image according to the predicted semantic segmentation image and the gesture image;

s60, generating a fitting composite image according to the distorted clothes image, the predicted semantic segmentation image and the arm image, and finishing four-stage virtual fitting based on the 2D image.

The reference person image and the target clothing image may be derived from a dedicated photo website or a clothing shopping website, and are not particularly limited herein. The reference person image is an image of a person needing to try on the target clothes, and in order to achieve a better trying effect, the image should completely keep the front features of the person as much as possible; the fitting target clothes are images of clothes which need fitting by a person in the reference person image, and the images should reserve the texture, the shape and other characteristics of the clothes as much as possible. Before virtual fitting, a neuron network needs to be built and trained according to requirements, and a training part and a testing part are included in a data set. In the training and testing processes, the input is a reference person image and a target clothes image, and the target clothes image can be a clothes image corresponding to clothes worn by a person in the reference person image in the training process and is randomly selected in the testing process. And the images are processed to a uniform size prior to training and testing.

After the reference person image and the target clothes image are acquired, the reference person image and the target clothes image are processed to generate a reference person representation chart, and the reference person representation chart comprises the following steps: extracting a human body part semantic segmentation graph and a posture graph from a reference human image; obtaining a rough body shape image and a non-try-on area detail image according to the human body part semantic segmentation image and the reference person image; and synthesizing the rough body shape graph, the posture graph and the non-fitting area detail graph to obtain a reference person representation.

In the process, firstly, a pose estimation and extraction method is adopted to extract key points of people in a reference person image to obtain a pose image, and a semantic segmentation algorithm is used to perform semantic segmentation on the reference person image to obtain a human body part semantic segmentation image. Then, the background region labels of the human body part semantic segmentation image are used for processing the reference human body image to obtain a rough body shape image. Then, the reference person image is processed by using the face, hair and trousers region labels of the human body part semantic segmentation map to obtain a non-fitting region detail map (the complete non-fitting region comprises a face region, a hair region, a trousers region and an arm region, and only the image of the face region, the hair region and the trousers region is extracted as the non-fitting region detail map, namely other non-fitting regions except the arm region). And finally, connecting the obtained posture graph, the rough body shape graph and the non-fitting region detail graph on the channel to form a reference person representation graph. Here, the pose estimation extraction algorithm and the semantic segmentation algorithm may be selected according to actual requirements, which are not specifically limited, for example, in an example, openpos (pose estimator) is used to perform pose estimation on a reference person image, so as to obtain a pose graph with 18 key points (including hair, left eye, right eye, left eyebrow, right eyebrow, nose, left shoulder, right shoulder, left hand, right hand, and the like); using LIP (Self-redundant Structure-systematic Learning and A New Benchmark for Human matching, human analyzer) to carry out semantic segmentation on the reference Human image, and obtaining a Human body part semantic segmentation map with 20 labels including the background. Whereby the on-channel connected reference person representation is composed 22 of the obtained pose graph, the rough body shape graph and the non-fitting area detail graph. The extraction of the detail map of the non-fitting area not only considers the images of the face and the hair, but also considers other areas except the fitting area so as to completely reserve the details of the non-fitting clothes area, thereby being beneficial to improving the fitting effect and being closer to reality.

After the synthesis of the reference person representation, pixel warping the target garment image according to the person representation, as shown in fig. 2, comprises: the reference person representation P and the target clothes image C are respectively transmitted into a twin convolution neural network W with two unshared parameters ₁ Extracting features, wherein the two twin convolutional neural networks have the same structure; introducing the features of the reference person representation and the target clothing image into a regression network W ₂ Predicting a spatial transformation parameter theta; and warping pixels of the target clothes image according to the spatial transformation parameters to obtain a pixel warped clothes image

Specifically, before pixel distortion is carried out on the clothing image, the reference person representation P and the target clothing image C are respectively transmitted into the twin convolutional neural network W with two unshared parameters ₁ And extracting features, and predicting the spatial transformation parameter theta according to the extracted features. Here, the two created neuron networks for feature extraction have the same structure, as in an example, the twin convolutional neural networks each contain four downsampled convolutional layers of step 2 and two convolutional layers of step 1, and the two neuron networks W are used ₁ After the features are respectively extracted, the extracted features are then combined by matrix multiplication and transmitted to a regression network W ₂ (comprising two convolution layers with the step length of 2, two convolution layers with the step length of 1 and a full-connection layer), and finally using the tanh activation function to activate and obtain a spatial transformation parameter, so that the spatial transformation parameter is used for distorting the pixel of the target clothes image to obtain a pixel distorted clothes image

In the subsequent step, the clothing image is distorted in combination with the pixel (distorted clothing image C as the subsequent step) _W ) Independent of clothesThe fused graph prediction refers to a semantic segmentation graph of the target clothes in the image of the target clothes worn by the person.

In another embodiment, generating more natural and realistic distorted garment shape and texture, in addition to the above-mentioned pixel distortion, the distortion of the target garment image may also be performed on the feature of the target garment image, as shown in fig. 3, including: transmitting the target clothing image C into the convolutional neural network W ₃ Extracting features; warping (TPS conversion) the extracted features of the target garment image according to the spatial transformation parameters; transmitting the distorted features into a deconvolution neural network W corresponding to the convolutional neural network structure ₄ Deriving a feature-warped garment image with feature warping

And predicting warped garment composite mask M _C 。

In one example, after feature extraction is performed on a target clothes image by using five convolutional layers with convolutional kernels of 3 and step length of 1, the feature map is sampled by using five sampling networks with the same size as the feature map extracted by the convolutional layers, so that feature warping is realized. Then, the five feature-warped outputs are input to the deconvolution layers corresponding to the five convolution layers, which are decoded to generate feature-warped garment images and predict warped garment composite masks (of the 4-channel outputs of the decoder, the first 3 channels output the feature-warped garment images, and the 4 th channel outputs the predicted warped garment composite masks). Finally, the distorted clothes composite mask is used for carrying out element-by-element multiplication operation on the pixel distorted clothes image with the distorted pixels and the characteristic distorted clothes image with the distorted characteristics to obtain a distorted clothes image C _W . In a subsequent step, the distorted garment image C is combined _W And predicting the semantic segmentation graph of the target clothes in the image of the target clothes worn by the reference person by the clothes-independent fusion graph.

In the embodiment, a strategy of combining pixel distortion and feature distortion is used in the process of distorting clothes, so that a more natural and vivid distorted clothes shape and texture are generated, meanwhile, the robustness of deformation, rotation and shielding is improved, and the problem of distortion of shape distortion and texture when pixel distortion is used independently is solved.

Then, the clothes area and the arm area in the human body part semantic segmentation graph are fused to obtain a clothes irrelevant fusion graph, and the clothes irrelevant fusion graph M is obtained according to the clothes irrelevant fusion graph _f And distorting the clothing image

Semantic segmentation map M for predicting target clothes in image of target clothes worn by reference person _P As shown in fig. 4. Specifically, a clothing region and an arm region in the human meaning segmentation graph are fused into an inseparable region to obtain a clothing irrelevant fusion graph M _f (ii) a Then the clothes irrelevant fusion picture M is taken _f And distorting clothing image C _W Connected as a semantically split network W ₅ Obtaining a semantic segmentation graph M by input prediction _P . Here, the semantic segmentation map M _P The semantic segmentation graph is used for helping the subsequent arm generation step to synthesize a new arm image and helping to synthesize pixels at each body part in the fitting synthesis process. In practical applications, a standard U-Net network may be used to predict the semantic segmentation graph, or any other network structure capable of achieving this purpose, such as other variants of U-Net, may be used, and is not limited herein. When a standard U-Net network is used, the two inputs are respectively a distorted clothes image and a clothes irrelevant fusion image, and the output is a predicted semantic segmentation image. It should be clear that the sequence of the steps of generating the clothing-independent fusion map and warping the target clothing image can be adjusted according to the actual situation, and is not limited in detail here.

After the semantic segmentation map is generated, the step of generating an arm image according to the predicted semantic segmentation map and the gesture map is immediately performed, as shown in fig. 5, which includes: extracting arm mask M from semantic segmentation image _a (ii) a Extracting palm image I from reference person image according to palm key point in posture diagram _hand (ii) a Generating an arm image I using a palm image and an arm mask _arm . Specifically, the palm picture I _hand Extracted from the original reference person image according to the palm joint points in the pose graph, arm mask M _a Extracted from the output of the semantic segmentation network, and two images are input to the arm generation network W ₆ In generating a new arm image I _arm . Similarly, in practical applications, a standard U-Net network may be used to predict the arm image, or any other network structure capable of achieving this purpose, such as other variants of U-Net, and is not limited herein. When a standard U-Net network is used, the two inputs are a palm picture and an arm mask picture respectively, and the output is a generated arm image.

It is known that, in the process of trying on clothes, if the generated arm image is inconsistent with the arm image of the reference person image, for example, when the reference person wears long sleeves and the target clothes is short sleeves, the generated try-on composite image has a high probability of error, and the problem can be well solved in the arm image generation process. Firstly, a semantic segmentation graph of a target garment worn by a reference person in a target garment image is predicted according to a garment-independent fusion graph and a distorted garment image, then an arm mask is extracted based on the semantic segmentation graph for generating an arm image, and a complete arm image which should be displayed after fitting synthesis can be obtained, so that the authenticity of a fitting effect is greatly improved, and the visual quality of the image is improved.

After the arm image is generated, a fitting composite image is generated according to the distorted clothes image, the predicted semantic segmentation map, the arm image and other non-fitting region detail maps, as shown in fig. 6, including: segmenting semantics into graphs M _p The extracted irrelevant details of the non-fitting area and the distorted clothes image C _W And arm image I _arm Afferent try-on synthetic network W ₇ To obtain a preliminary fitting synthetic image I _P And predict the try-on image synthesis mask M _com (ii) a Then, the mask M is synthesized using the try-on image _com Synthesizing image I for preliminary fitting _P And distorting clothing image C _W Carrying out element-by-element multiplication to obtain a final try-on synthetic image I _f . Similarly, in practical applications, the method can be usedThe prediction of the arm image is performed by using a standard U-Net network, and may be performed by using any other network structure capable of achieving the purpose, such as other variants of U-Net, which are not limited herein. When using a standard U-Net network, the four inputs are the semantic segmentation graph M _p The extracted irrelevant details of the non-fitting area and the distorted clothes image C _W And arm image I _arm And outputting the image as a preliminary fitting synthetic image I _P And predicting the fitting image synthesis mask M _com 。

The present invention also provides a four-stage virtual fitting apparatus 100 based on a 2D image, as shown in fig. 7, including: an image acquisition module 110 for acquiring a reference person image and a target clothes image; a clothing-independent fusion image generation module 120, configured to extract a human body part semantic segmentation image and a posture image from the reference person image, and fuse a clothing region and an arm region in the human body part semantic segmentation image to obtain a clothing-independent fusion image; a clothing distortion module 130 for distorting the target clothing image to obtain a distorted clothing image; a semantic segmentation map generation module 140, configured to predict a semantic segmentation map of the target garment worn by the reference person in the target garment image according to the garment-independent fusion map and the distorted garment image; an arm image generation module 150, configured to generate an arm image according to the predicted semantic segmentation map and the pose map; and the fitting synthesis module 160 is configured to generate a fitting synthesis image according to the distorted clothes image, the predicted semantic segmentation map, and the arm image, and complete four-stage virtual fitting based on the 2D image.

Specifically, the image acquisition module can refer to a person image and a target clothes image from a special picture website or a clothing shopping website, wherein the reference person image is an image of a person needing to try on target clothes, and in order to achieve a better trying effect, the image should completely keep the positive characteristics of the person as much as possible; the fitting target clothes are images of clothes which need fitting by a person in the reference person image, and the images should reserve the texture, the shape and other characteristics of the clothes as much as possible. Before virtual fitting, a neuron network needs to be built and trained according to requirements, and a training part and a testing part are included in a data set. In the training and testing process, the input is a reference person image and a target clothes image, and the target clothes image can be a clothes image corresponding to clothes worn by a person in the reference person image in the training process and is randomly selected in the testing process. And the images are processed to a uniform size prior to training and testing.

After the image obtaining module 110 obtains the reference person image and the target clothing image, the clothing-independent fusion map generating module 120 processes the reference person image and the target clothing image to generate a reference person representation map, which includes: the image extraction unit is used for obtaining a rough body shape image and a non-fitting region detail image according to the human body part semantic segmentation image and the reference person image, wherein the non-fitting region detail image is a non-fitting region image except for an arm region; a reference person representation diagram generating unit for synthesizing a reference person representation diagram according to the rough body shape diagram, the posture diagram and the non-fitting region detail diagram;

in the process, firstly, a pose estimation and extraction method is adopted to extract key points of people in a reference person image to obtain a pose image, and a semantic segmentation algorithm is used to perform semantic segmentation on the reference person image to obtain a human body part semantic segmentation image. Then, the image extraction unit processes the reference person image by using the background region labels of the human body part semantic segmentation image to obtain a rough body shape image. Then, the reference person image is processed by using the face, hair and trousers region labels of the human body part semantic segmentation map to obtain a non-fitting region detail map (the complete non-fitting region comprises a face region, a hair region, a trousers region and an arm region, and only the image of the face region, the hair region and the trousers region is extracted as the non-fitting region detail map, namely other non-fitting regions except the arm region). And finally, the reference person representation generating unit is connected on the channel according to the obtained posture graph, the rough body shape graph and the non-fitting region detail graph to form a reference person representation. Here, the pose estimation extraction algorithm and the semantic segmentation algorithm may be selected according to actual requirements, which are not specifically limited, for example, in an example, openpos (pose estimator) is used to perform pose estimation on a reference person image, so as to obtain a pose graph with 18 key points (including hair, left eye, right eye, left eyebrow, right eyebrow, nose, left shoulder, right shoulder, left hand, right hand, and the like); using LIP (Self-redundant Structure-systematic Learning and A New Benchmark for Human matching, human analyzer) to carry out semantic segmentation on the reference Human image, and obtaining a Human body part semantic segmentation map with 20 labels including the background. Whereby the on-channel connected reference person representation is composed 22 of the obtained pose graph, the rough body shape graph and the non-fitting area detail graph. The extraction of the detail map of the non-fitting area not only considers the images of the face and the hair, but also considers other areas except the fitting area so as to completely reserve the details of the non-fitting clothes area, thereby being beneficial to improving the fitting effect and being closer to reality.

After the synthesis of the reference person representation, pixel warping the target clothing image according to the person representation, in a clothing warping module comprising: the first feature extraction unit is used for respectively transmitting the reference person representation image and the target clothes image into two twin convolutional neural networks with non-shared parameters to extract features, and the two twin convolutional neural networks have the same structure; the spatial transformation parameter prediction unit is used for transmitting the characteristics of the reference person representation diagram and the target clothes image into the regression network to predict spatial transformation parameters; and the pixel distortion unit is used for distorting the pixels of the target clothes image according to the space transformation parameters to obtain a pixel distortion clothes image and finish the distortion of the target clothes image.

Specifically, before pixel distortion is carried out on a clothes image, a reference person representation image and a target clothes image are respectively transmitted into two twin convolutional neural networks with unshared parameters to extract characteristics, and then space transformation parameters are predicted according to the extracted characteristics. Here, the two created neuron networks for feature extraction have the same structure, for example, in an example, the twin convolutional neural network includes four down-sampling convolutional layers with a step size of 2 and two convolutional layers with a step size of 1, after the features are respectively extracted by using the two neuron networks, the extracted features are then combined by matrix multiplication and transmitted into a regression network (including two down-sampling convolutional layers with a step size of 2, two convolutional layers with a step size of 1, and a full-link layer), and finally, a spatial transformation parameter is obtained by using tanh activation function activation, so that pixels of the target clothes image are warped by using the spatial transformation parameter to obtain a pixel warped clothes image. In a subsequent step, a semantic segmentation map of the target garment in the image of the reference person wearing the target garment is predicted by combining the pixel warped garment image and the garment independent fusion map.

In another embodiment, a more natural and realistic distorted garment shape and texture is generated, and in the process of distorting the target garment, in addition to the above-mentioned process of pixel distortion, the characteristics of the target garment image are distorted at the same time, and the garment distortion module further comprises: the second feature extraction unit is used for transmitting the target clothes image into a convolutional neural network to extract features; the characteristic warping unit is used for warping the extracted characteristics of the target clothes image according to the space transformation parameters; the characteristic distortion clothes image generating unit is used for transmitting the distorted characteristics into a deconvolution neural network corresponding to the convolution neural network structure to obtain a characteristic distortion clothes image subjected to characteristic distortion and a prediction distortion clothes composite mask; and the distorted clothes synthesizing unit is used for synthesizing the pixel distorted clothes image and the characteristic distorted clothes image based on the distorted clothes synthesizing mask to obtain a distorted clothes image.

In one example, after feature extraction is performed on a target clothes image by using five convolutional layers with convolutional kernels of 3 and step length of 1, the feature map is sampled by using five sampling networks with the same size as the feature map extracted by the convolutional layers, so that feature warping is realized. Then, the five feature warped outputs are input to the deconvolution layers corresponding to the five convolution layers, which are decoded to generate feature warped garment images and predicted warped garment composite masks (of the 4-channel output of the decoder, the first 3 channels output the feature warped garment images, the 4 th channel outputs the predicted warped garment composite masks). Finally, the distorted clothing composite mask is used for carrying out element-by-element multiplication operation on the pixel distorted clothing image and the feature distorted clothing image to obtain a distorted clothing image C _W . In a subsequent step, the warped clothing image and the clothing-independent fusion image are combined to predict a reference personAnd (4) wearing a semantic segmentation graph of the target clothes in the target clothes image.

In the embodiment, a combined strategy of pixel distortion and feature distortion is used in the distorted clothing module, so that a more natural and vivid distorted clothing shape and texture are generated, meanwhile, the robustness of deformation, rotation and occlusion is improved, and the problem of distortion of the shape distortion texture when the pixel distortion is used alone is solved.

And then, the clothes irrelevant fusion map generation module fuses a clothes area and an arm area in the human body part semantic segmentation map to obtain a clothes irrelevant fusion map, and the semantic segmentation map generation module predicts the semantic segmentation map of the target clothes in the target clothes image worn by the reference person according to the clothes irrelevant fusion map and the distorted clothes image. Specifically, a clothes irrelevant fusion map generation module fuses a clothes area and an arm area in a human meaning segmentation map into an inseparable area to obtain a clothes irrelevant fusion map; and then the semantic segmentation map generation module connects the clothes irrelevant fusion map and the distorted clothes image as input of a semantic segmentation network to predict and obtain the semantic segmentation map. Here, the semantic segmentation map is a semantic segmentation map of a target garment worn by a reference person, and the semantic segmentation map can help a subsequent arm generation step to synthesize a new arm image and help each body part to synthesize pixels in a fitting synthesis process. In practical applications, a standard U-Net network may be used to predict the semantic segmentation graph, and any other network structure capable of achieving the purpose may also be used, which is not specifically limited herein. When a standard U-Net network is used, the two inputs are respectively a distorted clothes image and a clothes irrelevant fusion image, and the output is a predicted semantic segmentation image. It should be clear that the sequence of the steps of generating the clothing-independent fusion map and warping the target clothing image can be adjusted according to the actual situation, and is not limited in detail here.

After the semantic segmentation graph is generated, the step of generating an arm image according to the predicted semantic segmentation graph and the gesture graph is immediately carried out, wherein the arm image generation module comprises: the arm mask extracting unit is used for extracting an arm mask from the semantic segmentation image; a palm image extracting unit that extracts a palm image from the reference person image based on a palm key point in the posture diagram; and the arm image synthesis unit is used for generating an arm image by using the palm image and the arm mask. Specifically, the palm image is extracted from an original reference person image according to a palm joint point in the gesture graph, the arm mask is extracted from the output of the semantic segmentation network, and the two images are input into the arm generation network to generate a new arm image. Similarly, in practical applications, a standard U-Net network may be used to predict the arm image, and any other network structure capable of achieving this purpose may also be used, which is not specifically limited herein. When a standard U-Net network is used, the two inputs are a palm picture and an arm mask picture respectively, and the output is a generated arm image.

After the arm image is generated, the try-on synthesis module generates a try-on synthesis image according to the distorted clothes image, the predicted semantic segmentation image, the arm image and other non-try-on area detail images. The fitting synthesis module comprises: the preliminary synthesis unit is used for obtaining a preliminary fitting synthesis image and a prediction fitting image synthesis mask according to the semantic segmentation image, the extracted non-fitting region detail image, the distorted clothes image and the arm image; and the secondary synthesis unit is used for performing element-by-element multiplication operation on the preliminary try-on synthetic image and the distorted clothes image by using the try-on image synthetic mask to obtain a final try-on synthetic image. Specifically, the method comprises the following steps: segmenting semantics into graphs M _p The extracted detail drawing irrelevant details of the non-try-on area and the distorted clothes image C _W And arm image I _arm Afferent try-on synthetic network W ₇ To obtain a preliminary fitting synthetic image I _P And predict the try-on image synthesis mask M _com (ii) a Thereafter, the mask M is synthesized using the try-on image _com Synthesizing image I for preliminary fitting _P And distorting clothing image C _W Carrying out element-by-element multiplication to obtain a final try-on synthetic image I _f . Similarly, in practical applications, a standard U-Net network may be used to predict the arm image, or any other network structure capable of achieving this purpose, such as other variants of U-Net, and is not limited herein. When makingWhen a standard U-Net network is used, four inputs are respectively a semantic segmentation graph M _p The extracted irrelevant details of the non-fitting area and the distorted clothes image C _W And arm image I _arm And outputting the image as a preliminary fitting synthetic image I _P And predicting the fitting image synthesis mask M _com 。

In one example, a virtual fitting network structure is formed by the network structures described in the above examples, and a network structure of the conventional CP-VTON method is used to perform virtual fitting. The data set comprises 14221 training sets and 2032 testing sets, the training sets are used for training in a CP-VTON model, and then the testing sets are used for testing to obtain 2032 test-on synthetic images; processing the model by the same method to obtain 2032 test set images, and evaluating the effect of the virtual fitting by adopting four indexes of SSIM (structural similarity), IS (acceptance Score), FID (fringe acceptance Distance) and PSNR (peak signal-to-noise ratio), wherein the difference between the fitting synthetic image generated by SSIM and FID calculation and the original test set image IS the image quality generated by calculation. As shown in table 1, the evaluation results of the virtual try-on method (ourmethod in table) and the conventional CP-VTON method provided by the present invention are shown. It can be seen from the table that the virtual fitting method of the present invention can achieve better fitting effect compared to the conventional CP-VTON method.

Table 1: comparative graph of evaluation results

Method	SSIM	IS	FID	PSNR
					CP-VTON	0.745	2.757	19.108	21.111
Our method	0.867	3.016	9.158	24.505

In addition, the qualitative experience try-on effect is obtained from three aspects of the completeness of generated clothes, the completeness of generated arms, the clothes fit degree and the overall try-on composite image visual quality, wherein a comparison diagram of the try-on effect based on the completeness of the generated clothes is shown in fig. 8, wherein (a) in fig. 8 is a reference person image, (b) is a target clothes image, (c) is the try-on effect of the CP-VTON method, and (d) is the try-on effect of the method; a comparison diagram of the fitting effect based on the generated arm integrity is shown in fig. 9, wherein (a) in fig. 9 is a reference person image, (b) is a target clothes image, (c) is the fitting effect of the CP-VTON method, and (d) is the fitting effect of the method of the present invention; the comparison graph of the effect based on the clothes fit degree/the overall try-on composite image visual quality is shown in fig. 10, wherein (a) in fig. 10 is a reference person image, (b) is a target clothes image, (c) is the try-on effect of the CP-VTON method, and (d) is the try-on effect of the method of the present invention. As can be seen from the figure, compared with the traditional CP-VTON method, the method of the invention has better effects on the integrity of the generated clothes, the integrity of the generated arms, the fit degree of the clothes/the visual quality of the integral try-on composite image.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of program modules is illustrated, and in practical applications, the above-described distribution of functions may be performed by different program modules, that is, the internal structure of the apparatus may be divided into different program units or modules to perform all or part of the above-described functions. Each program module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one processing unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software program unit. In addition, the specific names of the program modules are only used for distinguishing one program module from another, and are not used for limiting the protection scope of the present invention.

Fig. 11 is a schematic structural diagram of a terminal device provided in an embodiment of the present invention, and as shown, the terminal device 200 includes: a processor 220, a memory 210, and a computer program 211 stored in the memory 210 and operable on the processor 220, such as: and (5) virtual fitting program. The processor 220 implements the steps in each of the above embodiments of the virtual fitting method when executing the computer program 211, or the processor 220 implements the functions of each of the modules in each of the above embodiments of the virtual fitting apparatus when executing the computer program 211.

The terminal device 200 may be a notebook, a palm pc, a tablet computer, a mobile phone, or the like. Terminal device 200 may include, but is not limited to, processor 220, memory 210. Those skilled in the art will appreciate that fig. 11 is merely an example of terminal device 200, does not constitute a limitation of terminal device 200, and may include more or fewer components than shown, or some components in combination, or different components, such as: terminal device 200 may also include input-output devices, display devices, network access devices, buses, and the like.

The Processor 220 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor 220 may be a microprocessor or the processor may be any conventional processor or the like.

The memory 210 may be an internal storage unit of the terminal device 200, such as: a hard disk or a memory of the terminal device 200. The memory 210 may also be an external storage device of the terminal device 200, such as: a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal device 200. Further, the memory 210 may also include both an internal storage unit of the terminal device 200 and an external storage device. The memory 210 is used to store the computer program 211 and other programs and data required by the terminal device 200. The memory 210 may also be used to temporarily store data that has been output or is to be output.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or recited in detail in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed terminal device and method may be implemented in other ways. For example, the above-described terminal device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logical function, and may be implemented in other ways, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by sending instructions to relevant hardware by the computer program 211, where the computer program 211 may be stored in a computer-readable storage medium, and when the computer program 211 is executed by the processor 220, the steps of the method embodiments may be implemented. Wherein the computer program 211 comprises: computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying the code of computer program 211, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the content of the computer readable storage medium can be increased or decreased according to the requirements of the legislation and patent practice in the jurisdiction, for example: in some jurisdictions, computer-readable media does not include electrical carrier signals and telecommunications signals in accordance with legislative and proprietary practices.

It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be construed as the protection scope of the present invention.

Claims

1. A four-stage virtual fitting method based on 2D images is characterized by comprising the following steps:

acquiring a reference person image and a target clothes image;

extracting a human body part semantic segmentation map and a posture map from the reference person image, and fusing a clothes area and an arm area in the human body part semantic segmentation map to obtain a clothes irrelevant fusion map;

warping the target clothes image to obtain a warped clothes image;

generating a fitting composite image according to the distorted clothes image, the predicted semantic segmentation image and the arm image, and completing four-stage virtual fitting based on the 2D image;

after the human body part semantic segmentation map and the pose map are extracted from the reference human image, the method further comprises the following steps:

obtaining a rough body shape image and a non-fitting area detail image according to the human body part semantic segmentation image and the reference human image, wherein the non-fitting area detail image is a non-fitting area image except for an arm area;

synthesizing a reference person representation according to the rough body shape graph, the posture graph and the non-fitting area detail graph;

the warping the target garment image to obtain a warped garment image includes:

respectively transmitting the reference person representation image and the target clothes image into two twin convolutional neural network extraction features of non-shared parameters, wherein the two twin convolutional neural networks have the same structure;

transmitting the characteristics of the reference person representation and the target clothes image into a regression network to predict spatial transformation parameters;

and distorting pixels of the target clothes image according to the space transformation parameters to obtain a pixel distorted clothes image, and completing the distortion of the target clothes image.

2. The four-stage virtual fitting method of claim 1, wherein the generating an arm image from the predicted semantic segmentation map and pose map comprises:

extracting an arm mask from the semantic segmentation image;

extracting a palm image from a reference person image according to palm key points in the posture graph;

an arm image is generated using the palm image and the arm mask.

3. The four-stage virtual fitting method according to claim 1,

after the warping the pixels of the target clothes image according to the spatial transformation parameters to obtain a warped clothes image, the method further comprises the following steps:

transmitting the target clothes image into a convolutional neural network to extract features;

warping the extracted features of the target clothes image according to the spatial transformation parameters;

transmitting the distorted features into a deconvolution neural network corresponding to the convolution neural network structure to obtain feature distorted clothes images and predicted distorted clothes composite masks after feature distortion;

synthesizing the pixel distorted clothes image and the feature distorted clothes image based on the distorted clothes synthesis mask to obtain a distorted clothes image; and/or the presence of a gas in the gas,

generating a try-on composite image according to the distorted garment image, the predicted semantic segmentation map and the arm image comprises:

obtaining a preliminary fitting composite image and a prediction fitting image composite mask according to the semantic segmentation graph, the extracted non-fitting region detail graph, the distorted clothes image and the arm image;

and performing element-by-element multiplication operation on the preliminary try-on composite image and the distorted clothes image by using the try-on image composite mask to obtain a final try-on composite image.

4. A four-stage virtual fitting device based on 2D images, comprising:

the fitting synthesis module is used for generating a fitting synthesis image according to the distorted clothes image, the predicted semantic segmentation image and the arm image so as to complete four-stage virtual fitting based on the 2D image;

the clothing irrelevant fusion map generation module further comprises:

the image extraction unit is used for obtaining a rough body shape image and a non-fitting area detail image according to the human body part semantic segmentation image and the reference person image, wherein the non-fitting area detail image is a non-fitting area image except for an arm area;

a reference person representation diagram generating unit for synthesizing a reference person representation diagram according to the rough body shape diagram, the posture diagram and the non-fitting region detail diagram;

the garment distortion module comprises:

the first feature extraction unit is used for respectively transmitting the reference person representation image and the target clothes image into two twin convolutional neural network extraction features of non-shared parameters, and the two twin convolutional neural networks have the same structure;

the spatial transformation parameter prediction unit is used for transmitting the characteristics of the reference person representation diagram and the target clothes image into a regression network to predict spatial transformation parameters;

and the pixel distortion unit is used for distorting the pixels of the target clothes image according to the space transformation parameters to obtain a pixel distortion clothes image and finish the distortion of the target clothes image.

5. The four-stage virtual fitting apparatus according to claim 4, wherein the arm image generating module includes:

an arm mask extracting unit, configured to extract an arm mask from the semantic segmentation image;

a palm image extracting unit configured to extract a palm image from a reference person image based on a palm key point in the posture diagram;

and the arm image synthesis unit is used for generating an arm image by using the palm image and the arm mask.

6. The four-stage virtual fitting apparatus according to claim 4, wherein the clothes distortion module further comprises:

the second feature extraction unit is used for transmitting the target clothes image into a convolutional neural network to extract features;

the characteristic warping unit is used for warping the extracted characteristics of the target clothes image according to the space transformation parameters;

the characteristic distortion clothes image generating unit is used for transmitting the distorted characteristics into a deconvolution neural network corresponding to the convolution neural network structure to obtain a characteristic distortion clothes image subjected to characteristic distortion and a prediction distortion clothes composite mask;

a distorted garment synthesis unit for synthesizing the pixel distorted garment image and the feature distorted garment image based on the distorted garment synthesis mask to obtain a distorted garment image; and/or

The fitting synthesis module comprises:

the preliminary synthesis unit is used for obtaining a preliminary fitting synthetic image and a prediction fitting image synthetic mask according to the semantic segmentation map, the extracted non-fitting region detail map, the distorted clothes image and the arm image;

and the secondary synthesis unit is used for performing element-by-element multiplication operation on the preliminary try-on synthetic image and the distorted clothes image by using the try-on image synthetic mask to obtain a final try-on synthetic image.

7. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the 2D image based four-stage virtual fitting method according to any of claims 1-3 when running the computer program.

8. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the 2D image based four-stage virtual fitting method according to any one of claims 1 to 3.