CN116777738A

CN116777738A - Authenticity virtual fitting method based on clothing region alignment and style retention modulation

Info

Publication number: CN116777738A
Application number: CN202310901979.9A
Authority: CN
Inventors: 陈金广; 张馨; 马丽丽
Original assignee: Xian Polytechnic University
Current assignee: Xian Polytechnic University
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2023-09-19

Abstract

The invention discloses an authenticity virtual fitting method based on clothing region alignment and style retention modulation, which comprises the following steps: preprocessing a human body image I and a clothing image C to obtain a human body segmentation map S, a mask M, a figure posture map P and a dense figure posture P of a figure in the human body image I _d Constructing a human body representation irrelevant to clothes to obtain a pretreatment result; constructing a virtual try-on model based on the Ji Hefeng grid hold modulation of the clothing region, and designing a loss function; training a Ji Hefeng grid-based modulation virtual try-on model based on the clothing region by using the paired image training set and the preprocessing result obtained by the paired image training set through the processing in the step one, and continuously optimizing model weights through back propagation; the human body image of the tested person and the clothing image to be tested by the tested person are subjected to the first step to obtain a pretreatment result, and the obtained pretreatment result is formedThe fruit input maintains a modulated virtual try-on model based on the clothing region pair Ji Hefeng grid. The problems of fuzzy characteristics and poor reality of the obtained result when the existing method is used for generating the try-on picture are solved.

Description

Authenticity virtual fitting method based on clothing region alignment and style retention modulation

Technical Field

The invention relates to the technical field of virtual fitting, in particular to an authenticity virtual fitting method based on clothing region alignment and style retention modulation.

Background

The virtual fitting technology can enable a user to check the changing effect without actually touching and changing the clothing. The technology is hopeful to solve the try-on problem of clothing online marketing to a certain extent, so that the shopping experience of consumers can be improved, the return rate is reduced, the sales cost is reduced, and the development of the clothing electronic commerce industry is promoted. The upper body effect of the garment is simulated by the technology, so that a garment designer can conveniently and quickly adjust the garment shape, and convenience is brought to the fashion field.

Most of the traditional virtual try-on methods are based on three-dimensional model construction, and the methods firstly establish a human body three-dimensional model, then model clothing and cloth and render the clothing on the human body three-dimensional model, and show the effect of the clothing after wearing. Although the method can realize more accurate physical simulation and simulate the clothing try-on effect to a certain extent, the method needs to acquire the three-dimensional information of the human body by means of special equipment such as a depth camera and a three-dimensional scanner or a multi-angle human body image reconstruction technology, has the problems of large calculated amount, complex model, long modeling time, easiness in leakage of personal privacy and the like, and limits the popularization and application of the method.

With the development of deep learning technology, a virtual fitting method based on image generation is proposed, which uses information such as human body posture estimation, human body analysis and the like to guide the deformation of clothing to generate a fitting result image, and realizes a virtual fitting effect of migrating target clothing onto a human body in the image. The method only needs the human body image and the clothing image shot by the common camera as input, does not need to use special equipment to acquire the three-dimensional information of the human body, has high try-on image generation speed, is suitable for the existing public consumption scenes such as clothing online shopping and the like, and draws great attention. Among the classical works are CP-VTON (Wang B, zheng H, liang X, et al, towards mechanical-preserving image-based virtual try-on network [ C ]// Proceedings of the European Conference on Computer Vision (ECCV). 2018:589-604.), ACGPN (Yang H, zhang R, guo X, et al, towards photo-realistic virtual try-on by adaptively generating-preserving image content [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020:7850-7859.), HR-VITON (Lee S, gu G, park S, et al, high-resolution virtual try-on with misalignment and occlusion-handled conditions [ C ]// European Conference on Computer Vision (ECCV). M: springer Nature Switzerland, 2022:204-219.). The CP-VTON deforms the garment by learning parameters of TPS (Thin-Plate line) transformation through a geometric matching module, so that the problem that the garment characteristics are difficult to retain is solved. The ACGPN determines whether the image content needs to be generated or reserved according to the generated semantic layout, and introduces second-order differential constraint on TPS, so that the garment deformation process is more reasonable, and the quality of the virtual try-on result is improved. The HR-VITON combines the clothing deformation and semantic layout generation process, so that a larger breakthrough is achieved, and a higher-quality try-on result is obtained.

However, there is a misaligned region between the human body segmentation map generated by the conventional method and the deformed garment, and in the try-on result, there is still a problem that it is difficult to generate clear human body characteristics, and there is no sense of realism. The invention improves the prior method to solve the defect of the prior method.

Disclosure of Invention

The invention aims to provide an authenticity virtual try-on method based on clothing region alignment and style retention modulation, which solves the problems of fuzzy characteristics and poor sense of reality of the obtained result when the conventional method is applied to generate try-on pictures.

The technical scheme adopted by the invention is that the method for virtually trying on the authenticity based on the clothing region alignment and the style retention modulation is implemented according to the following steps:

step one, preprocessing a human body image I and a clothing image C to obtain a human body segmentation map S of a person in the human body image I, a mask M of the clothing image C, a posture map P of the person in the human body image I and a dense posture P _d Constructing a human body representation irrelevant to clothes to obtain a pretreatment result;

step two, constructing a retention modulation virtual try-on model based on a clothing region pair Ji Hefeng grid, and designing a loss function;

training a Ji Hefeng-grid-kept modulated virtual try-on model based on the clothing region by using the paired image training set and the preprocessing result obtained by the paired image training set through the processing in the step one, and continuously optimizing model weights through back propagation; after training is completed, the finally obtained weight is stored;

and step four, a human body image of a tested person and a clothing image to be tested by the tested person are subjected to the step one to obtain a pretreatment result, the obtained pretreatment result is input into a Ji Hefeng grid-holding modulation virtual test model based on a clothing region, and a test image with reality is obtained based on weight obtained through training, so that a virtual test function is realized.

The present invention is also characterized in that,

the first step is as follows:

for the human body image I and the clothing image C, a human body segmentation map S of a Person in the human body image I, a mask M of the clothing image C and a gesture map P of the Person in the human body image I are obtained by using a human body analyzer PGN (Instance-level human parsing via part grouping network) and a gesture estimator OpenPose (real time Multi-Person 2D Pose Estimation using Part Affinity Fields), the clothing region in the human body segmentation map S and the clothing region in the human body image I and an arm region containing sleeve length information are removed, and a hand region which is difficult to reconstruct is reserved by combining the gesture map P of the Person in the human body image I, so that the human body segmentation map S irrelevant to clothing is obtained _a And clothing independent human body image I _a Dense pose P of person in human body image I is obtained using dense pose estimator Densepose (Dense human pose estimation in the wild) _d Compared with the compact gesture map P, the dense gesture P _d Contains more accurate human body posture information and is more beneficial to guiding the generation process of human body images.

In the second step, a virtual try-on model for keeping modulation based on the clothing region pair Ji Hefeng grid is composed of a condition generator and a try-on generator; in the condition generator stage, two sets of information are used as input, one set is clothing image C and mask M of clothing image C, and the other set is clothing-independent human body segmentation map S _a And dense gesture P _D The method comprises the steps of carrying out a first treatment on the surface of the Deformation garment C capable of outputting body posture _w Mask M _w And a generated target semantic segmentation mapIn the stage of fitting generator, inputting human body image I irrelevant to clothes _a Deformation garment C matching human body posture _w Dense posture P _D And the generated target semantic segmentation map +.>Generating final try-on result image under multi-stage guidance +.>And takes it as output.

In the second step, the condition generator includes two encoders E ₁ and E₂ The decoder is composed of a multi-stage feature fusion module and the condition alignment processing module; the processing procedure of the condition generator stage is as follows: first by two encoders E ₁ and E₂ Extracting multi-stage features of two groups of information input respectively, forming a feature pyramid by using output features of a coding layer, and combining two encoders E ₁ and E₂ The final stage of characteristics obtained respectively and />Splicing on the channels, inputting a convolution layer, and upsampling to obtain a first-level apparent stream ++>Will->Upsampling by residual block to obtain the original partition map feature +.>Will->As an initial garment featureFirst-level appearance stream->Initial segmentation map feature->Initial clothing features->Inputting the video stream and the segmentation map features into a decoder consisting of a multi-level feature fusion module, realizing information exchange between the video stream and the segmentation map features, gradually refining the video stream and the segmentation map features in the multi-level decoding process, and outputting a final video stream by the decoder>And final segmentation map feature->Will->As an initial target segmentation map->Use of the final appearance stream->Obtaining the initial deformed clothing C _w,raw Mask M corresponding thereto _w,raw An initial target segmentation map +.>Initial deformation garment C _w,raw Mask M corresponding thereto _w,raw Inputting a condition alignment processing module for performing alignment processing including from the beginningInitial target segmentation map->Remove misaligned areas from the original deformed garment C _w,raw Removing body areas that would be covered by arms, hair, etc. over the garment in the final try-on image; deformed garment C capable of matching human body posture _w Mask M _w And the generated target semantic segmentation map +.>

In the second step, the loss function used in the condition generator stage isThe specific expression is as follows:

wherein ,the loss of L1, the loss of perception, the loss of regularization of the appearance flow, the loss of least square countermeasure, the loss of standard pixel cross entropy and the loss of alignment of the clothing area are respectively; lambda (lambda) _L1 、λ _VGG 、λ _SM 、λ _cGAN 、λ _CE and λ_AL Respectively indicate-> Corresponding super parameters; respectively set as lambda _L1 ＝λ _CE ＝λ _AL ＝10,λ _SM ＝2,λ _VGG ＝λ _cGAN ＝1；

Wherein L1 is lostPerception loss->Is defined as follows:

in the formula ,w_i Determining the relative importance between each item of apparel,represents an i-th level of apparent flow, S _c Is a segmentation map of the clothing region in the human body image I, I _c Is the clothing area in the human body image I, phi calculates the difference between the feature maps obtained by two inputs through VGG network, and +.>Representing based on the i-th level of apparent flow->Performing a deforming operation on the mask M of the clothing image C, similarly, < >>Representing based on the i-th level of apparent flow->Performing deformation operation on the clothing image C;

appearance flow regularization lossLeast squares contrast loss function>Is defined as follows:

in the formula ,representing the 4 th, i.e. last, level of appearance stream, D representing the arbiter and X representing the input of the generator;

standard pixel cross entropy lossIs defined as follows:

in the formula ,H_s 、W _s and C_s The height, width and number of channels of the human body segmentation map S are represented. S is S _k,y,x Andrepresenting a human segmentation map S corresponding to coordinates (x, y) in channel k and a generated target semantic segmentation map +.>Pixel values of (2);

for the loss of alignment of the clothing region, the loss is defined as the deformed clothing C matching the human body posture _w Mask M of (2) _w And the generated target semantic segmentation map +.>Middle-wear channel->The L1 norm of the difference is defined as follows:

in the second step, the test-fit generator consists of a plurality of groups of residual blocks-up sampling layer structures; the processing procedure of the test-wear generator stage is as follows: human body image I which is not related to clothing _a Deformation garment C matching human body posture _w Dense posture P _D As inputs to the first set of residual blocks, these inputs are collectively referred to as the try-on component; meanwhile, the fitting component adjusts the size to be matched with the size of the feature map of the output of each group of residual block-up sampling structure, then the fitting component is spliced with the feature map of the output of each group of residual block-up sampling structure, and the splicing result is used as the input feature map of the next layer of residual block, so that the effect of refining the features on multiple scales is realized; using target semantic segmentation graphs generated fromModulating the characteristic diagram of the input residual block by the obtained modulation parameters so as to guide the process of generating the test-on image; in the residual blocks, the input characteristic diagram obtained by the input of the first group of residual blocks or the residual block-up sampling splicing is processed in two paths, the main path sequentially passes through two groups of network layers consisting of SPM modulation (namely style retention modulation), reLU activation layer and convolution layer, and the branch is added with the main path after passing through a group of same network layers to obtain the output characteristic

In step two, the target semantic segmentation map generated from the target semantic segmentation map is usedSPM modulation of the feature map of the input residual block with the obtained modulation parameters is divided into two steps, the first step is aimed at integrating the context style and semantic layout, and the second step is aimed at integrating the fused informationInjecting information into the feature map;

in the first step of modulation, two parameters are generated: four semantic modulation parameters and two context modulation parameters; the semantic modulation parameters include and />Two groups. Context modulation parameters (gamma) _c ,β _c ) Generating from the original feature map without passing through a normalization layer, the non-normalized feature map can retain more of the context grid, performing a first modulation to generate a fused modulation parameter gamma _f and β_f ：

For the ith layer feature map F ⁱ First modulation to generate a fused modulation parameter gamma _f and β_f ：

In the second step of the modulation, the normalized feature map is modulated using the fused modulation parametersObtain modulated characteristics->

In the second step, the loss function of the test generator stage is designed asThe specific expression is as follows:

wherein ,the method comprises the steps of fitting image perception loss, countermeasure loss and characteristic matching loss; and />Representation->Corresponding super parameters; are respectively set as->

Try-on image perceived lossIn the generated fitting results->The similarity between the human body image I and the characteristic layer is restrained, and the image perception loss is tried on>The expression of (2) is:

wherein V is the number of layers used in the VGG network, V ⁽ⁱ⁾ and R_i The feature map and the element number are obtained from the ith layer in the VGG network respectively;

countering lossesLoss of matching features->For generating data which fit reality on multiple scales and stabilizing the training process against loss +.>Loss of matching features->The expression of (2) is as follows:

wherein ,D_I For discriminator, T _I Is discriminator D _I The number of layers in the layer is equal, and K_i Respectively discriminator D _I The activation function and the number of elements of the i-th layer.

Step three, the paired image training set consists of a plurality of character images and corresponding clothing images worn by the character images, and in the training process, the image pairs and the preprocessing results obtained in the step one are used; the human body representation irrelevant to the clothes is obtained through pretreatment, so that the generalization of the model can be prevented from being influenced;

in the training process, optimizing parameters in the back propagation process by using an Adam optimizer; the condition generator is trained for 150000 generations and the try-on generator is trained for 100000 generations.

The beneficial effects of the invention are as follows:

the method provided by the invention provides a new virtual try-on model based on the high-resolution image, and can generate a try-on result with reality. Specifically, the method of the invention realizes the alignment of the generated segmentation graph and the clothing shape through the arrangement of the clothing region alignment loss, and the alignment is combined with the try-on generator kept by the context wind grid, so that the method can generate the result with more sense of reality details, the human body features are clear and visible, and the skin and clothing boundary is clear. The effectiveness of the present invention was demonstrated by evaluation in extensive experiments conducted on the VITON-HD dataset.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a process of the pretreatment operation of the present invention;

FIG. 3 is a diagram of a virtual try-on model based on a garment region pair Ji Hefeng grid hold modulation constructed in the invention;

FIG. 4 is a diagram of residual block structure of grid hold in a test generator in the method of the present invention;

FIG. 5 is a block diagram of a grid hold modulation SPM in a residual block of a grid hold in a trial generator in the method of the present invention, with two modulation phases marked by two dashed boxes;

FIG. 6 is a flow chart of the method of the present invention as applied to a virtual fitting scene;

FIG. 7 is a graph comparing the results of the present invention with the results of the current HR-VITON method when a first wearer is trying on a coat;

FIG. 8 is a graph comparing the results of the present invention with the results of the HR-VITON test performed by the present invention when a second wearer is trying on a coat;

FIG. 9 is a graph comparing the results of the present invention with the results of the current HR-VITON method when a third wearer is trying on a coat.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention provides an authenticity virtual try-on method based on clothing region alignment and style retention modulation, which is implemented according to the following steps as shown in figures 1-6:

specifically, for a human body image I and a clothing image C, a human body segmentation map S of a Person in the human body image I, a mask M of the clothing image C and a gesture map P of the Person in the human body image I are obtained by using a human body analyzer PGN (Instance-level human parsing via part grouping network) and a gesture estimator OpenPose (real time Multi-Person 2D Pose Estimation using Part Affinity Fields), a clothing region in the human body segmentation map S and the clothing image I and an arm region containing sleeve length information are removed, and a hand region which is difficult to reconstruct is reserved by combining the gesture map P of the Person in the human body image I, so that a human body segmentation map S irrelevant to clothing is obtained _a And clothing independent human body image I _a . Dense pose P of person in human body image I using dense pose estimator Densepose (Dense human pose estimation in the wild) _d Compared with the compact gesture map P, the dense gesture P _d Contains more accurate human body posture information and is more beneficial to guiding the generation process of human body images.

in the second step, a virtual try-on model for keeping modulation based on the clothing region pair Ji Hefeng grid is composed of a condition generator and a try-on generator; in the condition generator stage, two sets of information are used as input, one set is clothing image C and mask M of clothing image C, and the other set is clothing-independent human body segmentation map S _a And dense gesture P _D . Deformation garment C capable of outputting body posture _w Mask M _w And a generated target semantic segmentation mapIn the stage of fitting generator, inputting human body image I irrelevant to clothes _a Deformation garment C matching human body posture _w Dense posture P _D And the generated target semantic segmentation map +.>Generating final try-on result image under multi-stage guidance +.>And takes it as output.

The condition generator includes two encoders E ₁ and E₂ And a decoder and a condition alignment processing module which are formed by the multistage feature fusion module. The processing procedure of the condition generator stage is as follows: first by two encoders E ₁ and E₂ Extracting multi-stage features of two groups of information input respectively, forming a feature pyramid by using output features of a coding layer, and combining two encoders E ₁ and E₂ The final stage of characteristics obtained respectively and />Splicing on the channels, inputting a convolution layer, and upsampling to obtain a first-level apparent stream ++>Will->Upsampling by residual block to obtain the original partition map feature +.>Will->As an initial clothing feature->First-level appearance stream->Initial segmentation map feature->Initial clothing features->Inputting the video stream and the segmentation map features into a decoder consisting of a multi-level feature fusion module, realizing information exchange between the video stream and the segmentation map features, gradually refining the video stream and the segmentation map features in the multi-level decoding process, and outputting a final video stream by the decoder>And final segmentation map feature->Will->As an initial target segmentation map->Use of the final appearance stream->Obtaining the initial deformed clothing C _w,raw Mask M corresponding thereto _w,raw An initial target segmentation map +.>Initial deformation garment C _w,raw Mask M corresponding thereto _w,raw An input condition alignment processing module for performing alignment processing including dividing the image from the initial object division map +.>Remove misaligned areas from the original deformed garment C _w,raw Removing body areas that would be covered by arms, hair, etc. over the garment in the final try-on image; deformed garment C capable of matching human body posture _w Mask M _w And the generated target semantic segmentation map +.>The loss function used by the condition generator stage is +.>The specific expression is as follows:

wherein ,the loss of L1, the loss of perception, the loss of regularization of the appearance flow, the loss of least square countermeasure, the loss of standard pixel cross entropy and the loss of alignment of the clothing area are respectively; lambda (lambda) _L1 、λ _VGG 、λ _SM 、λ _cGAN 、λ _CE and λ_AL Respectively indicate->Corresponding super parameters; respectively set as lambda _L1 ＝λ _CE ＝λ _AL ＝10,λ _SM ＝2,λ _VGG ＝λ _cGAN ＝1。

L1 loss functionAnd perceptual loss function->Encouraging the garment deformation process results in a more conforming figure pose and maintaining garment characteristics. Wherein L1 is lost->Perception loss->Is defined as follows:

wherein ,w_i Determining the relative importance between each item of apparel,represents an i-th level of apparent flow, S _c Is a segmentation map of the clothing region in the human body image I, I _c Is the clothing area in the human body image I, phi calculates the difference between the feature maps obtained by two inputs through VGG network, and +.>Representing based on the i-th level of apparent flow->Performing a deforming operation on the mask M of the clothing image C, similarly, < >>Representing based on the i-th level of apparent flow->The deformation operation is performed on the clothing image C.

Appearance flow regularization loss functionForcing the appearance stream to be smoothed, least squares pairResistance loss function->For generating a more realistic segmentation map, the least squares contrast loss helps to generate high quality results compared to the original GAN loss. The definition of these two loss functions is as follows:

wherein ,representing the 4 th, i.e. last, level of the look stream, D representing the arbiter and X representing the input to the generator.

Standard pixel cross entropy lossFor improving the quality of the segmentation map generation, the definition is as follows:

wherein ,H_s 、W _s and C_s The height, width and number of channels of the human body segmentation map S are represented. S is S _k,y,x Andrepresenting a human segmentation map S corresponding to coordinates (x, y) in channel k and a generated target semantic segmentation map +.>Is a pixel value of (a).

Loss of alignment of the garment area designed for the present invention. The loss is defined as a deformed garment C matching the body posture _w Mask M of (2) _w And the generated target semantic segmentation map +.>Middle-wear channel->The L1 norm of the difference is defined as follows:

deformation garment C for matching human body posture _w Mask M of (2) _w And the generated target semantic segmentation map +.>Middle-wear channel->The gap between the two parts is restrained, so that the generation of unaligned areas of the clothing is minimized, and the problem of confusing and fusing of skin and clothing boundaries caused by unaligned areas is solved. Meanwhile, the problem that the segmentation map is generated wrongly due to the fact that the model is subjected to over fitting when the paired images are used for training is solved.

The try-on generator consists of multiple sets of residual block-upsampling layer structures. The processing procedure of the test-wear generator stage is as follows: human body image I which is not related to clothing _a Deformation garment C matching human body posture _w Dense posture P _D As inputs to the first set of residual blocks, these inputs are collectively referred to as the try-on component; at the same time, the fitting component adjusts the size to match the feature map size of the output of each set of residual block-up sampling structures, and then to each set of residual blocks-splicing the feature graphs of the outputs of the up-sampling structure, taking the spliced result as the input feature graph of the residual block of the next layer, thereby realizing the effect of refining the features on multiple scales; using target semantic segmentation graphs generated fromThe obtained modulation parameters modulate the characteristic diagram of the input residual block to guide the process of generating the test-on image. The present invention constructs a Style-preserved residual block based on Style-preserved modulation (Style-Preserved Modulation, SPM) in which the input of a first set of residual blocks or the input feature map obtained by residual block-upsampling stitching (see input feature map F in FIG. 4) _i ) The processing is divided into two paths, the main path sequentially passes through two network layers consisting of SPM modulation, reLU activation layer and convolution layer, and the branches are added with the main path after passing through a group of same network layers to obtain output characteristics (see output characteristics F in figure 4) _o ). SPM modulation is divided into two steps, the first step is aimed at integrating context style and semantic layout, and the second step is aimed at injecting the fused information into the feature map.

In the first step of modulation, two parameters are generated: four semantic modulation parameters and two context modulation parameters. The semantic modulation parameters include and />Two groups. Context modulation parameters (gamma) _c ,β _c ) Generating from the original feature map without passing through a normalization layer, the non-normalized feature map can retain more of the context grid, performing a first modulation to generate a fused modulation parameter gamma _f and β_f ：

The loss function of the try-on generator stage is designed asThe specific expression is as follows:

wherein ,the image perception loss, the countermeasure loss and the characteristic matching loss are respectively tried on. and />Representation->Corresponding super parameters. Are respectively set as->

Try-on image perceived lossIn the generated fitting results->The similarity with the feature level of the human body image I is restrained, and the generator is encouraged to obtain a fitting result which is close to real in terms of semantics. Try-on image perception loss->The expression of (2) is:

wherein V is the number of layers used in the VGG network, V ⁽ⁱ⁾ and R_i The feature map and the element number obtained by the ith layer in the VGG network are respectively.

Training the Ji Hefeng-grid-kept modulation virtual try-on model based on the clothing region by using the paired image training set and the preprocessing result obtained by the paired image training set through the processing in the step one, and continuously optimizing model weights through back propagation. After training, the finally obtained weight is stored.

And step three, the paired image training set consists of a plurality of character images and corresponding clothing images worn by the character images, and the image pairs and the preprocessing results obtained in the step one are used in the training process. The human body representation irrelevant to clothes is obtained through pretreatment, so that the generalization of the model can be prevented from being influenced. In the training process, the Adam optimizer is used to optimize parameters during the back propagation process. The condition generator is trained for 150000 generations and the try-on generator is trained for 100000 generations.

And step four, preprocessing the human body image of the tested person and the image of the clothes to be tested in the step one, and inputting the preprocessed image into a model to obtain a test result image.

Example 1

Case analysis and method verification

To verify the effectiveness of the present invention, the present invention was programmed in a PyTorch environment using a single NVIDIA Tesla V100 32GB GPU to train the model. The data set adopts a high-resolution virtual try-on data set VITON-HD. The dataset contained 13,679 images of a frontal female and their applied coat, the resolution of the images being 1024 x 768 pixels. These image pairs are divided into training and test sets, with the training set containing 11,647 pairs and the test set containing 2,032 pairs. The learning rate of the first stage generator was set to 0.0002, and the learning rate of the discriminator was set to 0.0002. The learning rate of the second stage generator was set to 0.0001, and the learning rate of the discriminator was set to 0.0004. The first stage batch is set to 8, training is performed for 150000 generations, the second stage batch is set to 2, training is performed for 100000 generations, and the learning rate is gradually attenuated after 80000 generations. By calculating the indexes of the structural similarity SSIM (Structural SIMilarity), the perceived similarity LPIPS (Learned Perceptual Image Patch Similarity), the French Lei Qiete starting distance FID (Frechet Inception Distance), the kernel starting distance KID (Kernel Inception Distance) and the starting score IS (Inception Score) for quantitative comparison verification, the method is superior to the advanced method in all indexes. Proved by the invention, the test result which is closer to the real situation is obtained on the virtual test task.

The comparison results are shown in Table 1.

Table 1 results comparison table

The results of the application on the virtual try-on task are shown in fig. 7-9.

FIG. 7 is a graph comparing the results of the present invention with the results of the current advanced method HR-VITON when the first wearer is trying on a coat. Including an image of the human body of the subject wearer, an image of a try-on garment, an HR-VITON method try-on image, a method try-on image herein. The dotted line box marks the unnatural area obtained by the HR-VITON method, including the erroneously generated arm area and the unnatural neckline boundary. The invention obtains the fitting result with more sense of reality.

FIG. 8 is a graph comparing the results of the present invention with the results of the current HR-VITON method when a second wearer is trying on a coat. Including an image of the human body of the subject wearer, an image of a try-on garment, an HR-VITON method try-on image, a method try-on image herein. The dotted box marks the unnatural areas obtained by the HR-VITON method, including the confusing fused skin and collar border, the garment having no natural folds or stacking at the shoulder and elbow. The invention obtains the fitting result with more sense of reality.

FIG. 9 is a graph comparing the results of the present invention with the results of the current HR-VITON method when a third wearer is trying on a coat. Including an image of the human body of the subject wearer, an image of a try-on garment, an HR-VITON method try-on image, a method try-on image herein. The dashed box marks the unnatural area obtained by the HR-VITON method, including the blurred human neck feature. The invention obtains the fitting result with more sense of reality.

In the method of the invention, convincing try-on results are generated. More realistic and natural results are obtained in terms of skin area generation, collar presentation, garment wear details, etc., which benefit from the loss of garment area alignment and the design of the context grid retention generator. In summary, the invention obtains the real try-on image with clear boundary between skin and clothing and sense of reality, which proves that the method of the invention is applicable.

Example 2

The authenticity virtual try-on method based on clothing region alignment and style retention modulation is implemented according to the following steps:

Example 3

The first step is as follows:

for the human body image I and the clothing image C, a human body segmentation map S of a person in the human body image I, a mask M of the clothing image C and a gesture map P of the person in the human body image I are obtained by using a human body analyzer PGN and a gesture estimator OpenPose, and the human body segmentation map S and the clothing in the human body image I are removedThe fitting area and the arm area containing sleeve length information are combined with the gesture image P of the person in the human body image I to reserve the hand area which is difficult to reconstruct, so as to obtain a human body segmentation image S irrelevant to clothes _a And clothing independent human body image I _a Dense pose P of person in human body image I is obtained by using dense pose estimator Densseface _d Compared with the compact gesture map P, the dense gesture P _d Contains more accurate human body posture information and is more beneficial to guiding the generation process of human body images.

Claims

1. The method for virtually fitting the clothing based on the clothing region alignment and the style retention modulation is characterized by comprising the following steps of: step one, preprocessing a human body image I and a clothing image C to obtain a human body segmentation map S of a person in the human body image I, a mask M of the clothing image C, a posture map P of the person in the human body image I and a dense posture P _d Obtaining a pretreatment result; constructing a retention modulation virtual try-on model based on a clothing region pair Ji Hefeng grid, and designing a loss function; training a Ji Hefeng-grid-kept modulated virtual try-on model based on the clothing region by using the paired image training set and the preprocessing result obtained by the paired image training set through the first processing, and continuously optimizing model weights through back propagation; after training is completed, the finally obtained weight is stored; and step four, a human body image of the tested person and a clothing image to be tested by the tested person are subjected to the step one to obtain a pretreatment result, the obtained pretreatment result is input into a Ji Hefeng grid holding modulation virtual test model based on a clothing region, and a test image with reality is obtained based on weight obtained through training.

2. The method for virtual fitting based on clothing area alignment and style retention modulation of claim 1, wherein step one is specifically:

for the human body image I and the clothing image C, a human body segmentation map S of a person in the human body image I, a mask M of the clothing image C and a gesture map P of the person in the human body image I are obtained by using a human body analyzer PGN and a gesture estimator OpenPose, and the human body segmentation map S and the human body image I are removedThe clothing region and the arm region containing sleeve length information are combined with the gesture image P of the person in the human body image I to reserve the hand region which is difficult to reconstruct, and a human body segmentation image S irrelevant to clothing is obtained _a And clothing independent human body image I _a Dense pose P of person in human body image I is obtained by using dense pose estimator Densseface _d Compared with the compact gesture map P, the dense gesture P _d Contains more accurate human body posture information and is more beneficial to guiding the generation process of human body images.

3. The method for virtually fitting based on clothing region alignment and style retention modulation according to claim 1, wherein in the second step, a virtual fitting model based on clothing region pair Ji Hefeng style retention modulation is composed of a condition generator and a fitting generator; in the condition generator stage, two sets of information are used as input, one set is clothing image C and mask M of clothing image C, and the other set is clothing-independent human body segmentation map S _a And dense gesture P _D The method comprises the steps of carrying out a first treatment on the surface of the Deformation garment C capable of outputting body posture _w Mask M _w And a generated target semantic segmentation mapIn the stage of fitting generator, inputting human body image I irrelevant to clothes _a Deformation garment C matching human body posture _w Dense posture P _D And the generated target semantic segmentation map +.>Generating final try-on result image under multi-stage guidance +.>And takes it as output.

4. The method for virtual fit based on garment region alignment and style retention modulation according to claim 3, wherein in step two,the condition generator includes two encoders E ₁ and E₂ The decoder is composed of a multi-stage feature fusion module and the condition alignment processing module; the processing procedure of the condition generator stage is as follows: first by two encoders E ₁ and E₂ Extracting multi-stage features of two groups of information input respectively, forming a feature pyramid by using output features of a coding layer, and combining two encoders E ₁ and E₂ The final stage of characteristics obtained respectively and />Splicing on the channels, inputting a convolution layer, and upsampling to obtain a first-level apparent stream ++>Will->Upsampling by residual block to obtain the original partition map feature +.>Will->As an initial clothing feature->First-level appearance stream->Initial segmentation map feature->Initial clothing features->Inputting the video stream and the segmentation map features into a decoder consisting of a multi-level feature fusion module, realizing information exchange between the video stream and the segmentation map features, gradually refining the video stream and the segmentation map features in the multi-level decoding process, and outputting a final video stream by the decoder>And final segmentation map feature->Will->As an initial target segmentation map->Use of the final appearance stream->Obtaining the initial deformed clothing C _w,raw Mask M corresponding thereto _w,raw An initial target segmentation map +.>Initial deformation garment C _w,raw Mask M corresponding thereto _w,raw An input condition alignment processing module for performing alignment processing including dividing the image from the initial object division map +.>Remove misaligned areas from the original deformed garment C _w,raw Removing body areas that would be covered by arms, hair, etc. over the garment in the final try-on image; deformed garment C capable of matching human body posture _w Mask M _w And the generated target semantic segmentation map +.>

5. The method for virtual fit based on clothing region alignment and style retention modulation of claim 4, wherein in step two, the loss function used by the condition generator stage isThe specific expression is as follows:

wherein ,the loss of L1, the loss of perception, the loss of regularization of the appearance flow, the loss of least square countermeasure, the loss of standard pixel cross entropy and the loss of alignment of the clothing area are respectively; lambda (lambda) _L1 、λ _VGG 、λ _SM 、λ _cGAN 、λ _CE and λ_AL Respectively indicate->Corresponding super parameters; respectively set as lambda _L1 ＝λ _CE ＝λ _AL ＝10,λ _SM ＝2,λ _VGG ＝λ _cGAN ＝1；

Wherein L1 is lostPerception loss->Is defined as follows:

standard pixel cross entropy lossIs defined as follows:

in the formula ,H_s 、W _s and C_s The height, width and channel number of the human body segmentation map S are represented; s is S _k,y,x Andrepresenting a human segmentation map S corresponding to coordinates (x, y) in channel k and a generated target semantic segmentation map +.>Pixel values of (2);

6. the method for virtual fitting based on clothing region alignment and style retention modulation according to claim 3, wherein in step two, the fitting generator consists of a plurality of sets of residual block-up sampling layer structures; the processing procedure of the test-wear generator stage is as follows: human body image I which is not related to clothing _a Deformation garment C matching human body posture _w Dense posture P _D As inputs to the first set of residual blocks, these inputs are collectively referred to as the try-on component; meanwhile, the fitting component adjusts the size to be matched with the size of the feature map of the output of each group of residual block-up sampling structure, then the fitting component is spliced with the feature map of the output of each group of residual block-up sampling structure, and the splicing result is used as the input feature map of the next layer of residual block, so that the effect of refining the features on multiple scales is realized; using target semantic segmentation graphs generated fromModulating the characteristic diagram of the input residual block by the obtained modulation parameters so as to guide the process of generating the test-on image; in the residual blocks, the input characteristic diagram obtained by the input of the first group of residual blocks or the residual block-up sampling splicing is processed in two paths, the main path sequentially passes through two groups of network layers consisting of SPM modulation, reLU activation layer and convolution layer, and the branches are added with the main path after passing through a group of identical network layers, so that output characteristics are obtained.

7. The method for virtual fit based on clothing region alignment and style retention modulation of claim 6, wherein in step two, the use is made of a target semantic segmentation map generated fromSPM modulation is carried out on the feature map of the input residual block by the obtained modulation parameters, wherein the first step aims at integrating the context style and the semantic layout, and the second step aims at injecting the fused information into the feature map;

in the first step of modulation, two parameters are generated: four semantic modulation parameters and two context modulation parameters; the semantic modulation parameters include and />Two groups; context modulation parameters (gamma) _c ,β _c ) Generating from the original feature map without passing through a normalization layer, the non-normalized feature map can retain more of the context grid, performing a first modulation to generate a fused modulation parameter gamma _f and β_f ：

In the second step of the modulation, the normalized feature map F is modulated using the fused modulation parameters ⁱ Obtaining modulated characteristics

8. The virtual fit method based on clothing region alignment and style retention modulation of claim 6, wherein in step two, the loss function of the fit generator stage is designed asThe specific expression is as follows:

wherein ,the method comprises the steps of fitting image perception loss, countermeasure loss and characteristic matching loss; /> and />Representation->Super parameters corresponding to the loss function; are respectively set as->

Try-on image perceived lossIn the generated fitting results->Phase with feature level between human body image ISimilarity is constrained, try-in image perception loss +.>The expression of (2) is:

9. The virtual fit method based on clothing region alignment and style retention modulation according to claim 6, wherein in step three, the paired image training set is composed of a plurality of character images and corresponding clothing images worn by the character images, and in the training process, such image pairs and the preprocessing results obtained by step one are used; the human body representation irrelevant to clothes is obtained through pretreatment, so that the generalization of the model can be prevented from being influenced.

10. The method for virtual fitting based on clothing region alignment and style retention modulation of claim 9, wherein in step three, parameters are optimized during back propagation using Adam optimizer during training; the condition generator is trained for 150000 generations and the try-on generator is trained for 100000 generations.