CN116310659B

CN116310659B - Training data set generation method and device

Info

Publication number: CN116310659B
Application number: CN202310555942.5A
Authority: CN
Inventors: 王威
Original assignee: Zhongshu Yuanyu Digital Technology Shanghai Co ltd
Current assignee: Zhongshu Yuanyu Digital Technology Shanghai Co ltd
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-08-08
Anticipated expiration: 2043-05-17
Also published as: CN116310659A

Abstract

The embodiment of the application provides a training data set generation method and device. The method comprises the following steps: shooting a first real hand to obtain a first real hand image and generating a first composite image for a first three-dimensional virtual scene; the first three-dimensional virtual scene comprises a first background model; attaching the image of the first real hand obtained from the first real hand image matting to the first composite image to obtain a first initial composite hand image; inputting the first initial synthesized hand image to a generator in a trained generated countermeasure network to obtain a first target synthesized hand image; and determining a training data set for training a hand gesture estimation algorithm according to the first target synthesized hand image and the gesture of the first real hand corresponding to the first real hand image. The scheme provided by the embodiment of the application can improve the authenticity of the synthesized hand image.

Description

Training data set generation method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for generating a training data set.

Background

Gesture recognition, i.e., hand pose estimation, refers to predicting three-dimensional coordinates of key points of a hand in an image containing the hand. Due to the rapid development of deep neural networks, significant progress has been made in hand pose estimation based on two-dimensional images. However, these deep neural network-based methods are highly dependent on a large amount of training data, making training costly.

Therefore, it has been proposed to train a hand pose estimation algorithm using a synthetic data set, but the application of a hand pose estimation algorithm trained using a synthetic data set to real data performs poorly.

Disclosure of Invention

In view of the foregoing, the present application has been made to provide a method and apparatus for generating a training data set that solves or at least partially solves the foregoing problems.

Thus, in one embodiment of the present application, a method of generating a training data set is provided. The method comprises the following steps:

shooting a first real hand to obtain a first real hand image and generating a first composite image for a first three-dimensional virtual scene; the first three-dimensional virtual scene comprises a first background model;

attaching the image of the first real hand obtained from the first real hand image matting to the first composite image to obtain a first initial composite hand image;

Inputting the first initial synthesized hand image to a generator in a trained generated countermeasure network to obtain a first target synthesized hand image;

and determining a training data set for training a hand gesture estimation algorithm according to the first target synthesized hand image and the gesture of the first real hand corresponding to the first real hand image.

In yet another embodiment of the present application, an electronic device is provided. The electronic device includes: a memory and a processor, wherein,

the memory is used for storing programs;

the processor is coupled to the memory for executing the program stored in the memory to implement the method of any one of the above.

According to the technical scheme provided by the embodiment of the application, the real hand image and the virtual background image are mixed through the trained generator in the generated countermeasure network, so that the real hand can be better fused into the virtual background, and the authenticity of the synthesized training data is improved. In addition, in the synthesized training data, the background is virtual, the hand is real, and because the hand gesture estimation algorithm aims at predicting the hand gesture, the real hand information in the training data can improve the training effect of the hand gesture estimation algorithm, and the virtual background information in the training data has small training influence on the hand gesture estimation algorithm, so that the hand gesture estimation algorithm trained by the training data synthesized by the technical scheme provided by the embodiment of the application has a good prediction effect on the real data.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, a brief description will be given below of the drawings that are needed in the embodiments or the prior art descriptions, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a network training method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for generating a training data set according to an embodiment of the present disclosure;

fig. 3 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application according to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Furthermore, in some of the flows described in the specification, claims, and drawings of this application, a plurality of operations occurring in a particular order, which operations may not be performed in the order in which they occur or in parallel. The sequence numbers of operations such as 101, 102, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

Before describing the method for generating the training data set provided by the embodiment of the application, a training process of the countermeasure generation network required by the method is described.

Fig. 1 is a schematic flow chart of a network training method according to an embodiment of the present application. The execution subject of the method can be a client or a server. The client may be hardware integrated on the terminal and provided with an embedded program, or may be an application software installed in the terminal, or may be a tool software embedded in an operating system of the terminal, which is not limited in this embodiment of the present application. The terminal can be any terminal equipment including a mobile phone, a tablet personal computer, a vehicle-mounted terminal equipment and the like. The server may be a common server, a cloud end, a virtual server, or the like, which is not particularly limited in the embodiment of the present application. As shown in fig. 1, the method includes:

101. and constructing a second three-dimensional virtual scene.

Wherein, a second hand model and a second background model are arranged in the second three-dimensional virtual scene; the second hand model is created from a second real hand.

102. And shooting the second real hand to obtain a second real hand image, and generating a second composite image for the second three-dimensional virtual scene.

The gesture of the second real hand corresponding to the second real hand image is the same as the gesture of the second hand model corresponding to the second synthetic image; and the image area of the second real hand in the second real hand image is consistent with the shape of the image area of the second hand model in the second composite image.

103. And attaching the image of the second real hand obtained from the second real hand image matting to an image area of the second hand model in the second synthetic image to obtain a second initial synthetic hand image.

104. The second initial composite hand image is determined to generate a training input to a generator in the reactance network.

105. And determining an illumination gradient map corresponding to the second target synthesized hand image output by the generator and an illumination gradient map corresponding to the second synthesized image as training input of the illumination discriminator in the generated countermeasure network.

The illumination discriminator is used for judging whether the training input is true or false; the illumination gradient map corresponding to the second target synthesized hand image is a false sample; and the illumination gradient map corresponding to the second composite image is a true sample.

106. And determining the hand image in the second target synthesized hand image output by the generator and the real hand image in the sample real hand image as training input of the texture discriminator in the generating countermeasure network.

The texture identifier is used for judging whether the training input is true or false; the hand image in the second target synthesized hand image is a false sample; and the real hand image in the sample real hand image is a real sample.

107. And taking the loss function meeting the convergence condition as an optimization target, and carrying out network optimization on the generated countermeasure network.

Wherein the loss function is determined from a training output of the illumination discriminator and a training output of the texture discriminator; the generator is for generating training data for a hand pose estimation algorithm.

In the above 101, in an example, the second hand model is created according to the second real hand, that is, the shape of the second hand model is the same as the shape of the second real hand.

In the second three-dimensional virtual scene, the pose of the second hand model may be determined by the pose of the second real hand. Specifically, in the second three-dimensional virtual scene, the pose of the second hand model may be adjusted according to the pose of the second real hand so that the pose of the second hand model is the same as the pose of the second real hand.

In 102, in order to obtain the second real hand image and the second composite image conveniently, in one possible implementation, the second hand model and the second real hand maintain a linkage relationship, that is, the posture of the second hand model and the posture of the second real hand remain consistent, and change with the change of the posture of the second real hand. In this way, the second real hand image can be obtained by shooting the second real hand at the preset shooting angle relative to the second real hand at the same time, and the second synthetic image can be generated for the second three-dimensional virtual scene at the virtual shooting angle relative to the second hand model; the preset shooting angle relative to the second real hand is the same as the virtual shooting angle relative to the second hand model; the virtual shooting angle relative to the second hand model refers to a shooting angle used by a virtual camera to shoot a second composite image in the second three-dimensional virtual scene. The second real hand image and the second composite image obtained in this way satisfy the above-described "the posture of the second hand model corresponding to the second composite image is the same as the posture of the second real hand corresponding to the second real hand image; and the shape of the image area of the second real hand in the second real hand image is identical to the shape of the image area of the second hand model in the second composite image.

Of course, in addition to the second real hand image and the second composite image being obtained in the above manner, the second real hand image and the second composite image may be obtained in other manners, only by ensuring that the obtained second real hand image and second composite image satisfy "the pose of the second hand model corresponding to the second composite image is the same as the pose of the second real hand corresponding to the second real hand image; and the image area of the second real hand in the second real hand image is consistent with the shape of the image area of the second hand model in the second composite image.

It should be noted that, the area size of the image area of the second real hand in the second real hand image and the area size of the image area of the second hand model in the second composite image may be the same or different, which is not particularly limited in the embodiment of the present application.

In 103, the image area of the second real hand in the second initial synthesized hand image is the same as the image area of the second hand model in the second synthesized image in size and in the same position.

When the size of the image area of the second real hand in the second real hand image is the same as that of the image area of the second hand model in the second synthesized image, the image of the second real hand obtained from the second real hand image matting is directly attached to the image area of the second hand model in the second synthesized image, and the second initial synthesized hand image can be obtained.

When the size of the image area of the second real hand in the second real hand image is different from that of the image area of the second hand model in the second synthesized image, scaling the image of the second real hand obtained from the second real hand image matting according to the size ratio of the image area of the second real hand in the second real hand image to the image area of the second hand model in the second synthesized image to obtain a scaled image of the second real hand; the scaled image has the same shape and size as the image area of the second hand model in the second composite image; and attaching the zoomed image to an image area of the second hand model in the second composite image to obtain a second initial composite hand image.

Currently, hand pose estimation algorithms may be based on monocular or multi-ocular vision. Wherein, multi-view refers to two or more views, for example: binocular, trinocular, etc. In the multi-view scene, the first real hand image, the first composite image, the first initial composite hand image, and the first target composite hand image are all multi-view images. Each of the multiple images includes: the plurality of view angles includes a first view angle, and the first view angle is any one of the plurality of view angles. For example: the binocular image includes a left view and a right view. Then, in the step 103, the step of fitting the image of the second real hand obtained from the matting of the second real hand image to the image area of the second hand model in the second composite image to obtain the second initial composite hand image may specifically include:

And attaching the image of the second real hand obtained from the first view angle image of the second real hand image to the image area of the second hand model in the first view angle image of the second synthesized image to obtain the first view angle image of the second initial synthesized hand image.

In the above 104, generating the countermeasure network (Generative adversarial network, GAN) includes a generator, a lighting discriminator, and a texture discriminator. The internal structures of the generator, the illumination discriminator, and the texture discriminator will be described in detail in the following embodiments.

The second initial synthesized hand image is input to a generator in the generation countermeasure network to be mixed by the generator to obtain a second target synthesized hand image output by the generator.

In 105, the second target synthesized hand image output by the generator may be processed to obtain a target illumination gradient map (i.e., an illumination gradient map corresponding to the second target synthesized hand image). The target illumination gradient graph is used for displaying the illumination intensity gradient change condition in the second target synthesized hand image.

The second composite image can be processed to obtain a reference illumination gradient map (namely, an illumination gradient map corresponding to the second composite image); the reference illumination gradient map is used for showing the change condition of the illumination intensity gradient in the second composite image. The second three-dimensional virtual scene is configured with lighting conditions.

Determining an illumination gradient map corresponding to the second target synthesized hand image and an illumination gradient map corresponding to the second synthesized image output by the generator to generate training input of an illumination discriminator in the reactive network; the illumination discriminator is used for judging the true or false of the training input; the illumination gradient map corresponding to the second target synthesized hand image is a false sample, namely the training label of the illumination gradient map is false; and the illumination gradient map corresponding to the second composite image is a true sample, namely the training label of the illumination gradient map is true. And (3) injection: the distribution condition of the illumination intensity gradient in the three-dimensional virtual scene is very close to the real condition, that is, the illumination gradient map corresponding to the second composite image is very close to the real condition, so that the illumination gradient map corresponding to the second composite image can be regarded as a real sample.

And inputting the illumination gradient map corresponding to the second target synthesized hand image output by the generator into an illumination discriminator in the generation countermeasure network so as to judge the probability that the illumination gradient map corresponding to the second target synthesized hand image belongs to true by the illumination discriminator.

And inputting the illumination gradient map corresponding to the second composite image into an illumination discriminator in the generation countermeasure network so as to judge the probability that the illumination gradient map corresponding to the second composite image belongs to true by the illumination discriminator.

In the above 106, the hand image in the second target synthesized hand image is a false sample, that is, the training label of the hand image is false; the real hand image in the sample real hand image is a real sample, that is, the training label of the real hand image is true.

The hand image in the second target synthesized hand image output by the generator is input to a texture discriminator in the generating countermeasure network, so that the probability that the hand image belongs to true is judged by the texture discriminator.

Inputting a real hand image in a sample real hand image into a texture discriminator in a generated countermeasure network so as to judge the probability that the real hand image belongs to true by the texture discriminator.

In 107 above, the penalty function for generating the countermeasure network may be constructed from the training output of the illumination discriminator and the training output of the texture discriminator.

And taking the loss function meeting the convergence condition as an optimization target, and performing network optimization on the generated countermeasure network.

When the loss function meets the convergence condition, training can be stopped. And (3) injection: the training to generate the countermeasure network may be referred to as countermeasure training.

The network training method provided by the embodiment of the application can enable the global illumination intensity gradient change condition of the target synthesized hand image generated by the generator to be close to the real illumination intensity gradient change condition, and can enable textures and colors in the hand image in the target synthesized hand image generated by the generator to be close to the textures and colors of the real hand, that is, the generator trained by the network training method provided by the embodiment of the application can generate a relatively real synthesized hand image, and the hand gesture estimation algorithm trained by the synthesized hand image has a good prediction effect on real data. In addition, the discriminator in the training method does not need to rely on the real pictures which are obtained by real shooting and contain the real hands and the real backgrounds, and the background arrangement cost for manufacturing various different illuminations, different textures and different object scales can be reduced.

Optionally, the method may further include:

108. and respectively carrying out Gaussian blur processing on the second target synthesized hand image and the second synthesized image output by the generator to obtain an illumination gradient map corresponding to the second target synthesized hand image and an illumination gradient map corresponding to the second synthesized image.

The specific implementation principle and specific implementation steps of the gaussian blur processing can be referred to the prior art, and are not described herein.

Optionally, the "generating the second composite image for the second three-dimensional virtual scene" in 102 may be implemented by:

1021. a plurality of second composite images are generated for a second three-dimensional virtual scene.

At least one of illumination, texture and scale corresponding to any two second composite images in the plurality of sample composite images is different. The texture refers to model texture in the three-dimensional virtual scene, and the scale refers to model scale size in the three-dimensional virtual scene.

In this way, the synthetic images with different illumination, different textures and different scales are utilized to train and generate the countermeasure network, so that the generated countermeasure network can learn information about illumination, textures, scales and the like in a training stage, and the generator is facilitated to generate the target hand synthetic images with different illumination, different textures and different scales for training a hand gesture estimation algorithm.

In another example, the second real hand may be wearable with a hand piece; the second hand model may be worn with a hand model created from the hand. It should be noted that, in the subsequent image matting, the hand ornament is used as a part of the hand for image matting, that is, the subtracted hand image includes the hand ornament image. The image area of the hand model is understood to be the image area of the combination of the hand model and the hand model worn thereon.

By adopting the technical scheme provided by the embodiment of the application, the generated countermeasure network can learn the hand characteristics of wearing the hand decorations in the training stage, so that the generator is helped to generate the target hand synthetic image with the hand wearing the hand decorations, and the target hand synthetic image is used for training a hand posture estimation algorithm.

Optionally, the generator involves a discrete wavelet transform module, a basic generator and a discrete wavelet inverse transform module;

the method can further comprise the following steps:

109. and performing discrete wavelet transformation on the second initial synthesized hand image by using a discrete wavelet transformation module to obtain a high-frequency image and a low-frequency image corresponding to the second initial synthesized hand image.

110. The low frequency map is input to a base generator to obtain a target low frequency map.

111. And performing discrete wavelet inverse transformation on the target low-frequency image and the high-frequency image by using a discrete wavelet inverse transformation module to obtain a second target synthesized hand image.

The low-frequency image is an image with relatively small gray level change, and can be called as a style image; the high-frequency image is an image with a relatively large gradation change, and may be referred to as a content image.

In one implementation, the generator includes: U-Net network structure; the residual error module in the U-Net network structure comprises a self-calibration part convolution layer. The purpose of using the self-calibrating partial convolution layer is to want to restrict the information flow, only let the effective information forward, discard the information that is not appropriate for the scene.

After the generation of the countermeasure network training is finished, the generated countermeasure network training data set can be used for generating a training data set. Fig. 2 is a flow chart illustrating a method for generating a training data set according to an embodiment of the present application. The execution subject of the method can be a client or a server. The client may be hardware integrated on the terminal and provided with an embedded program, or may be an application software installed in the terminal, or may be a tool software embedded in an operating system of the terminal, which is not limited in this embodiment of the present application. The terminal can be any terminal equipment including a mobile phone, a tablet personal computer, a vehicle-mounted terminal equipment and the like. The server may be a common server, a cloud end, a virtual server, or the like, which is not particularly limited in the embodiment of the present application. As shown in fig. 2, the method includes:

201. And shooting a first real hand image for the first real hand, and generating a first composite image for the first three-dimensional virtual scene.

Wherein the first three-dimensional virtual scene comprises a first background model.

202. And attaching the image of the first real hand obtained from the first real hand image matting to the first composite image to obtain a first initial composite hand image.

203. Inputting the first initial synthesized hand image to a generator in a trained generated countermeasure network to obtain a first target synthesized hand image.

204. And determining a training data set for training a hand gesture estimation algorithm according to the first target synthesized hand image and the gesture of the first real hand corresponding to the first real hand image.

The trained hand pose estimation algorithm may be applied to MR (Mixed Reality) glasses to estimate the hand pose of a wearing user and perform subsequent operations according to the hand pose estimation result.

In the above 201, in an example, the hand model may not exist in the first three-dimensional virtual scene. Thus, in the subsequent step 202, the image of the first real hand may be attached to any area in the first composite image, so as to obtain a first initial composite hand image.

In another example, a hand model may be present in the first three-dimensional virtual scene. Specifically, a first hand model is further arranged in the first three-dimensional virtual scene; the first hand model is created from a first real hand; the gesture of the first real hand corresponding to the first real hand image is the same as the gesture of a hand model corresponding to the first synthetic image; and the shape, the size or the inconsistency of the image area of the first real hand in the first real hand image and the image area of the first hand model in the first synthetic image can be consistent. The gesture of the first real hand corresponding to the first real hand image refers to the gesture of the first real hand when the first real hand image is obtained through shooting. The pose of the hand model corresponding to the first composite image refers to the pose of the first hand model when the first composite image is generated. In the first three-dimensional virtual scene, the first hand model may remain in a linkage relationship with the first real hand.

The specific implementation of "obtaining the first real hand image for the first real hand photographing and generating the first composite image for the first three-dimensional virtual scene" may refer to the specific implementation process of "obtaining the second real hand image for the second real hand photographing and generating the second composite image for the second three-dimensional virtual scene" in the above embodiments, and will not be described herein.

In 202 above, the specific implementation process of "attaching the image of the first real hand obtained from the first real hand image matting to the first composite image to obtain the first initial composite hand image" may refer to the specific implementation process of "attaching the image of the second real hand obtained from the second real hand image matting to the second composite image to obtain the second initial composite hand image" in each embodiment above, and will not be described herein again.

In 203, the specific implementation of "inputting the first initial synthesized hand image to the trained generator in the generating countermeasure network to obtain the first target synthesized hand image" may refer to the specific implementation of "inputting the second initial synthesized hand image to the generator in the generating countermeasure network to obtain the second target synthesized hand image" in each embodiment, which is not described herein.

In 204, the first target synthesized hand image and the pose of the first real hand corresponding to the first real hand image are used as one training data for training the hand pose estimation algorithm.

Alternatively, the "generating the first composite image for the first three-dimensional virtual scene" in 201 may be implemented by the following steps:

2011. a plurality of first composite images are generated for a first three-dimensional virtual scene.

At least one of illumination, texture and scale corresponding to any two first composite images in the plurality of first composite images is different.

Optionally, the first real hand image, the first composite image, the first initial composite hand image, and the first target composite hand image are all multi-view images. Details of the processing of the multi-view image may be referred to the corresponding content in the above embodiments, and will not be described herein.

The technical solutions provided in the embodiments of the present application will be described by way of example below:

the scheme comprises three different scenes: ambient light changes, hand texture changes, and object dimensions changes. The three required technologies of the scene are communicated.

The principle and thought are briefly described:

the second synthetic image data set is obtained by taking a hand model and a background model generated by Blender as quantification and taking illumination, scale, texture, hand ornament and the like as variables.

The main combination is divided into scene change composition and article addition composition.

The main logic of scene change synthesis is as follows:

binocular images and depth images generated by a three-dimensional virtual scene of a three-dimensional rendering software Blender covering different illumination, different scales and different textures, and a real hand image shot for a real hand are taken as reference real phases (ground true); the mapping between the input picture of the generator and the synthesized image generated by the generator and the mapping between the depth map of the input picture and the depth map of the synthesized image are learned by the deep learning network. The generation of the countermeasure network mainly uses a Pix2Pix (Image-to-Image Translation) network as a main backbone network (backbone), wherein the generator is a network architecture similar to U-Net, the first stage is coarse synthesis aiming at learning a rough scene condition, and the second stage is fine synthesis aiming at further refining the synthesis effect. The arbiter is a four-layer progressive extraction of the arbiter structure in the global representation patch GAN (PatchGAN), and aims to obtain better perception on the global structure.

The main logic of the article adding and synthesizing is as follows:

the flow is similar to scene change, and because the adding object needs to be fused with the scene condition, the U-Net bottom residual error module adopts self-calibration part convolution construction in the course of coarse synthesis, so that the information flow is expected to be limited, only effective information is forward, and information which is not appropriate to the scene is abandoned. Meanwhile, in order to avoid the feature distribution not to be destroyed, a batch normalization layer (Batchnormalization) in a residual error module is deleted, and a cavity convolution is used for expanding a receptive field before connecting an up-sampling layer, so that the receptive field skips over a local optimal solution. In the fine synthesis, cosine similarity between a synthesized object and a synthesized object is calculated through convolution, soft maximum (softmax) processing in space and channel dimension is carried out on a similarity graph, and then the convolution is transposed on the similarity graph by using a background to realize the whole fine combination process.

The loss functions of the generator and discriminator in GAN add feature matching loss in addition to laplacian pyramid loss, SSIM (Structural Similarity ) loss, VGG19 (Visual Geometry Group, visual geometry group) loss, which is to bring the generated image and the true image as close as possible to the feature center in the above-described discrimination network, to generate a better image. The generated countermeasure network is also learned together with a simple depth estimation network, which can be specifically a U-Net network structure. The reason for this is that changing the illumination, texture, does not affect the depth map, and therefore the depth estimation network should be as similar as possible based on the depth map estimated from the second target synthetic image generated by the generator and the depth map corresponding to the second synthetic image in the reference real phase, i.e. also in combination with the difference between the depth map estimated from the second target synthetic image generated by the generator and the depth map corresponding to the second synthetic image in the reference real phase according to the depth estimation network, determine the depth loss; generating a penalty function against the network may also include the depth penalty.

The implementation flow is as follows:

step one: constructing a three-dimensional hand model and a virtual background model by using a development engine Unity3D or other three-dimensional modeling tools;

Step two: and importing the generated three-dimensional hand model and virtual background model into a Blender, obtaining specific coordinates of each model, and simultaneously converting the coordinates into a corresponding world coordinate system.

Step three: 2 virtual cameras are built above the combined scene of the three-dimensional hand model and the virtual background model, a left camera and a right camera are respectively used for setting focal lengthsLength and width of photosensitive element>And->And a Baseline Baseline, and corresponding camera internal parameters and external parameters are calculated; the camera position is determined, so that the world coordinate system can be conveniently and subsequently converted into a camera coordinate system with each camera as an origin;

step four: converting the obtained Blender external parameters and internal parameters of the camera into internal parameters and external parameters of an open source computer vision library;

step five: the two cameras respectively output a three-dimensional virtual scene in the Blender into a two-dimensional image by using projection transformation, and the Blender is used for obtaining the distance between points on the two-dimensional image and the cameras, namely the value of a z channel (depth) of each image, so that an RGB image pair and a depth image pair generated by the left camera and the right camera are obtained;

step six: the depth map generated by the method is converted into a parallax map, and the main formula is as follows:

（1）

where P is depth, d is parallax, B is the base line length of the two cameras, f is the camera focal length in pixel units, and x and y are the column coordinates of the principal point of the left and right views, respectively.

Step seven: setting shooting tracks of two cameras, and determining the gesture change of a hand model in a Blender by connecting 21 preset key points on the marked palm and shooting the actual hand action by the actual camera; the sensing patches are attached to 21 preset key points of the real hand, the gesture of the real hand is determined according to the attached sensing patches of the 21 preset key points, and then the gesture of the hand model in the Blender is adjusted according to the gesture of the real hand.

Step eight: the illumination (color temperature 2500K, 3500K, 4500K, 5500K and 6500K, illumination direction N, NE, E, SE, S, SW, W, NW), model texture (male hand texture, female hand texture, child hand texture) and/or model scale (scaling model scale) in the three-dimensional virtual scene in Blender, or the real hand, or the hand wear of the real hand, is changed, and the above operations are repeated to obtain a binocular hand reference dataset containing a large number of depth, disparity and RGB image pairs. And (3) injection: n represents north, S represents south, W represents west, E represents east, NE represents northeast NW represents northwest, SE represents southeast, SW represents southwest.

Step nine: the binocular hand reference data set is roughly combined through a generator in the condition GAN to generate a rough scene synthetic image, a further combined effect is output through a fine combination, whether the generated image accords with real image distribution or not is judged through an illumination discriminator and a texture discriminator in the condition GAN, and a loss function is as follows:

（2）

wherein the method comprises the steps ofIs a loss function of conditional GAN +.>Loss function for feature matching, +_>For SSIM loss, < >>Loss for laplacian pyramid; />Is lost for VGG 19.

Step ten: the whole model training can use a cosine annealing algorithm, specifically, an optimizer Adam can be adopted, the initial value of the learning rate is set to be 0.001, and the measurement indexes are SSIM, PSNR (Peak Signal to Noise Ratio ) and MSE (Mean Square Error, mean square error).

The technical effect that this application target reached: the real hand image and the virtual background are utilized to be arranged and combined with variables such as illumination, scale, texture and the like, so that a data gap does not exist between the generated image and the real image, and the effect similar to a real data set can be achieved after the synthesized data is used for training; meanwhile, the method can eliminate the need of customizing equipment and waste of a large amount of manpower and material resources similar to a real data set. Due to the adoption of the method of combining the synthesis and reality, variables such as brightness, scale, texture, shielding and the like can be controlled, so that a model trained by using the synthesis data set can aim at the robustness of the variables. Finally, the method can use the real hand and the virtual environment to perform good fusion generation, so that more natural hand gestures can be possessed, and the problem that the hand gestures in the composite image generated by the gesture library are unreal in deformation, distortion and the like is solved.

Fig. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 3, the electronic device includes a memory 1101 and a processor 1102. The memory 1101 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device. The Memory 1101 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as static random access Memory (Static RandomAccess Memory, SRAM), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read Only Memory), EEPROM), erasable programmable Read-Only Memory (Electrical Programmable Read Only Memory, EPROM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.

The memory 1101 is configured to store a program;

the processor 1102 is coupled to the memory 1101, and is configured to execute the program stored in the memory 1101, so as to implement the methods provided in the above method embodiments.

Further, as shown in fig. 3, the electronic device further includes: communication component 1103, display 1104, power component 1105, audio component 1106, and other components. Only some of the components are schematically shown in fig. 3, which does not mean that the electronic device only comprises the components shown in fig. 3.

Accordingly, the present application also provides a computer-readable storage medium storing a computer program, where the computer program is capable of implementing the steps or functions of the method provided by each method embodiment.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM (Read Only Memory)/RAM (RandomAccess Memory ), magnetic disk, optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of generating a training data set, comprising:

determining a training data set for training a hand gesture estimation algorithm according to the first target synthesized hand image and the gesture of the first real hand corresponding to the first real hand image;

Constructing a second three-dimensional virtual scene; a second hand model and a second background model are arranged in the second three-dimensional virtual scene; the second hand model is created from a second real hand;

shooting a second real hand image aiming at the second real hand, and generating a second synthesized image aiming at the second three-dimensional virtual scene; the gesture of the second hand model corresponding to the second composite image is the same as the gesture of the second real hand corresponding to the second real hand image; the shape of the image area of the second real hand in the second real hand image is consistent with that of the image area of the second hand model in the second synthetic image;

attaching the image of the second real hand obtained from the second real hand image matting to an image area of the second hand model in the second synthetic image to obtain a second initial synthetic hand image;

determining the second initial composite hand image as a training input to a generator in the generated countermeasure network;

determining an illumination gradient map corresponding to the second target synthesized hand image output by the generator and an illumination gradient map corresponding to the second synthesized image as training input of an illumination discriminator in the generated countermeasure network; the illumination discriminator is used for judging the true or false of the training input; the illumination gradient map corresponding to the second target synthesized hand image is a false sample; the illumination gradient map corresponding to the second composite image is a true sample;

Determining a hand image in the second target synthesized hand image output by the generator and a real hand image in the sample real hand image as training inputs of a texture discriminator in the generated countermeasure network; the texture discriminator is used for judging the true or false of the training input; the hand image in the second target synthesized hand image is a false sample; the real hand image in the sample real hand image is a real sample;

taking the loss function meeting the convergence condition as an optimization target, and performing network optimization on the generated countermeasure network; the loss function is determined from a training output of the illumination discriminator and a training output of the texture discriminator.

2. The method as recited in claim 1, further comprising:

and respectively carrying out Gaussian blur processing on the second target synthesized hand image and the second synthesized image output by the generator to obtain an illumination gradient map corresponding to the second target synthesized hand image and an illumination gradient map corresponding to the second synthesized image.

3. The method of claim 1, wherein generating a second composite image for the second three-dimensional virtual scene comprises:

Generating a plurality of second composite images for the second three-dimensional virtual scene;

at least one of illumination, texture and scale corresponding to any two second composite images in the plurality of second composite images is different.

4. The method of claim 1, wherein the second real hand is worn with a hand piece; the second hand model is worn with a hand model created from the hand adornment.

5. The method according to any of claims 1 to 4, characterized in that a discrete wavelet transform module is involved in the generator.

6. The method according to any one of claims 1 to 4, wherein the generator comprises: U-Net network structure; the residual error module in the U-Net network structure comprises a self-calibration part convolution layer.

7. The method according to any one of claims 1 to 4, wherein a first hand model is further provided in the first three-dimensional virtual scene; the first hand model is created from the first real hand; the gesture of the first real hand corresponding to the first real hand image is the same as the gesture of the hand model corresponding to the first synthetic image; the shape of the image area of the first real hand in the first real hand image is consistent with that of the image area of the first hand model in the first synthetic image;

Attaching the image of the first real hand obtained from the first real hand image matting to the first composite image to obtain a first initial composite hand image, including:

and attaching the image of the first real hand obtained from the first real hand image matting to an image area of the first hand model in the first synthetic image to obtain a first initial synthetic hand image.

8. The method of any one of claims 1 to 4, wherein the first real hand image, the first composite image, the first initial composite hand image, and the first target composite hand image are all multi-view images.

9. An electronic device, comprising: a memory and a processor, wherein,

the memory is used for storing programs;

the processor, coupled to the memory, for executing the program stored in the memory to implement the method of any one of claims 1 to 8.