CN116188695A

CN116188695A - Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method

Info

Publication number: CN116188695A
Application number: CN202310194731.3A
Authority: CN
Inventors: 肖阳; 姜昌龙; 吴存霖; 郑璟泓; 曹治国
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-05-30

Abstract

The invention discloses a construction method of a three-dimensional hand gesture model and a three-dimensional hand gesture estimation method, which belong to the technical field of three-dimensional gesture estimation and comprise the following steps: cutting an RGB image carrying interactive hand information, setting uniformly distributed three-dimensional anchor points on the cut image, and obtaining three-dimensional coordinate information of each three-dimensional anchor point; inputting the image with the three-dimensional anchor points into a characteristic enhancement model to obtain three-dimensional offset and weight of each three-dimensional anchor point to all corresponding estimated nodes; the three-dimensional offset and the weight thereof and the three-dimensional coordinate information are fused to obtain the three-dimensional coordinates of the hand joint points, and then a loss function is determined; and training a three-dimensional hand gesture estimation network by using the loss function to obtain a target three-dimensional hand gesture model. The method and the device have the advantages that the interactivity among the local three-dimensional anchor points is constructed, so that the local three-dimensional anchor points have the global space and the perception of the context, the joint information of the joint points to be estimated is better captured, and the problems of self-shielding and mutual shielding of the interactive hands are reduced.

Description

Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method

Technical Field

The invention belongs to the technical field of image posture estimation, and particularly relates to a construction method of a three-dimensional hand posture model and a three-dimensional hand posture estimation method.

Background

There are many three-dimensional hand pose estimation methods currently available. The existing three-dimensional hand gesture estimation method can be mainly divided into the following two types.

First three-dimensional hand pose estimation based on depth image: the commonly used method is to extract features on a depth map, and then obtain coordinate position information of each joint point by using a regression method. The three-dimensional information of the hand is generated by regression prediction through a neural network by using a regression method, a probability density function is generated for each joint point of the hand by using the structures of the coding layer and the decoding layer, the probability density function is called a heat map, and the coordinate position of the predicted joint point is obtained from the position with the highest probability in the heat map. Or extracting information in the image by using a graph convolutional neural network, and obtaining three branches of the weight of each three-dimensional anchor point, the plane offset and the depth offset of each three-dimensional anchor point to other joint points on an original depth graph by using a two-dimensional convolutional neural network, and obtaining the result of the three-dimensional posture estimation of the human body by a mode of weighting and summing the three-dimensional anchor points.

The second type of three-dimensional hand pose estimation based on RGB images is challenging to predict the three-dimensional joint positions of the hand from the RGB images because the RGB images lack three-dimensional information. Segmentation of the hand region from the image, pose estimation, and fitting or mapping it to three-dimensional space to infer its position is accomplished using a network structure such as PoseNet. In order to solve the problem of hand occlusion, a multi-view method is used for hand pose estimation, which is effective in most view angles, and a single hand does not work well in cases where the hand is severely occluded. Current methods, while capable of improving the problems of occlusion and viewing angle variation that RGB images have, have their limitations with a single hand. For multiple hand targets in an image, or when a hand interacts with other objects, it will be difficult to identify them.

However, there are still many difficulties with the task of interactive hand pose estimation for RGB images, which can be broadly divided into two aspects: firstly, in the interaction task of the hand, shielding brings a lot of challenges to the prediction of the interaction hand; on the other hand, the corresponding node features of the left and right hands are very similar, resulting in difficulty in distinguishing the features of each hand node. The problems that the prediction result of the shielding condition is poor, the operation speed is low, the hand joint points cannot be effectively distinguished and the like are difficult to be well solved by the method.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a construction method of a three-dimensional hand gesture model and a three-dimensional hand gesture estimation method, which aim to ensure that the three-dimensional hand gesture model has global space and context perception by constructing interactivity between local three-dimensional anchor points, so that joint information of joint points to be estimated is better captured, and the problems of self-shielding and mutual shielding frequently occurring in interactive hands are weakened, thereby solving the technical problem of low accuracy of three-dimensional hand gesture estimation caused by shielding in the prior art.

In order to achieve the above object, according to one aspect of the present invention, there is provided a method for constructing a three-dimensional hand gesture model, including:

s1: cutting the RGB image carrying the interactive hand information to obtain a first hand image;

s2: setting a plurality of uniformly distributed three-dimensional anchor points on the first hand image to obtain a second hand image, and obtaining three-dimensional coordinate information of each three-dimensional anchor point in space;

s3: inputting the second hand image into a feature enhancement model for feature extraction and feature enhancement to obtain three-dimensional offset and weight of each three-dimensional anchor point to all corresponding estimated nodes;

s4: fusing the three-dimensional offset of each three-dimensional anchor point to the corresponding estimated joint point, the weight and the three-dimensional coordinate information of the three-dimensional anchor point to obtain the three-dimensional coordinate of the hand joint point corresponding to each three-dimensional anchor point;

s5: determining a loss function according to the three-dimensional coordinates of each hand joint point;

s6: and inputting the training sample into a three-dimensional hand gesture estimation network to be trained to train, and adjusting network parameters by using the loss function in the training process to finally obtain a trained three-dimensional hand gesture model.

In one embodiment, the S2 includes:

uniformly dividing the first hand image into N image blocks, setting a three-dimensional anchor point at the center of each image block and acquiring the plane coordinates of each image block;

setting a depth value respectively in front of and behind the depth value of the center in a world coordinate system by taking the depth value of the root node coordinate of the hand in each image block as the center, and obtaining three depth coordinates of the three-dimensional anchor point;

and fusing the plane coordinates of the N three-dimensional anchor points and the corresponding three depth coordinates to obtain 3N three-dimensional coordinate information corresponding to the N three-dimensional anchor points.

In one embodiment, the S4 includes:

the three-dimensional coordinates of the estimated joint point corresponding to the three-dimensional anchor point a are expressed as: plane coordinates

And depth coordinate position->

Wherein a is E A, < >>

And->

Respectively representing the plane coordinate and the depth coordinate position of the three-dimensional anchor point a; />

And->

Respectively representing the plane coordinate offset and the depth coordinate offset from the three-dimensional anchor point a to the corresponding estimated joint point j; />

The normalized weight representing the three-dimensional anchor point a to the corresponding estimated node j represents its contribution value.

In one of the embodiments of the present invention,

W _j (a) And the weight from the three-dimensional anchor point a to the corresponding estimated joint point j is represented.

In one embodiment, the feature enhancement model includes:

the multi-scale feature extractor is used for extracting features of the second hand image through a ResNet-50 network structure, selecting output features of a three-layer network structure as a multi-scale feature pyramid, fusing to obtain first image features, processing the first image features into the same plane size features by using a convolution layer and a group normalization layer, and flattening and connecting the same to obtain second image features;

the feature enhancement module is used for inputting the spatial position codes of the second image features and the three-dimensional anchor points into a plurality of coding layers for enhancement to obtain third image features;

the three-dimensional anchor point feature interaction module is used for extracting three-dimensional anchor point features in the second hand image through a three-dimensional anchor point estimator to obtain a fourth image feature, and inputting the fourth image feature and the third image feature into a plurality of decoding layers for decoding so as to output a fifth image feature;

an offset-weight prediction module for receiving the fifth image feature by using two multi-layer perceptron MLPs, one MLP predicting the coordinate offset O _j (a) Another MLP prediction weight W _j (a)。

In one embodiment, each coding layer contains a cross-attention structure and a self-attention structure, and the weights of the MLPs in all coding layers are shared;

for the self-attention structure, the input values are: q=d+p _q ，K＝D+P _q V=d; d represents the fifth image feature, P _q Spatial position coding for three-dimensional anchor points, denoted P _q ＝MLP(PE(a _q ))；

For the cross-attention structure, the input values are: q=d+p _q ，K＝a _q V=e, K represents the selection of q three-dimensional anchor coordinate positions for the reference point in the deformable attention model, E representing the third image feature;

all the coding layers extract the third image features corresponding to each three-dimensional anchor point and establish interaction between the three-dimensional anchor points.

In one embodiment, the loss function includes a first loss function and a second loss function; the step S5 comprises the following steps:

generating a first loss function according to the three-dimensional coordinates and the true value information of each hand joint point, and supervising the estimation result of the obtained three-dimensional hand gesture as a first supervision signal;

and generating a second loss function according to the three-dimensional coordinates of each hand joint point and the weight corresponding to each three-dimensional anchor point, and taking the second loss function as a second supervision signal to supervise the distributed weight of the three-dimensional anchor point.

In one embodiment, the first loss function is:

and->

Respectively representing the true value and the predicted value of the plane coordinate of the j-th joint point, +.>

And->

Respectively representing a true value and a predicted value of the depth coordinate of the jth joint point; alpha represents a given weight, L _τ (. Cndot.) loss represents a smooth L1 loss;

the second loss function is:

τ ₁ and τ ₂ Given 1 and 3, respectively, to calculate a smoothed depth loss value;

the loss function is: loss=λ ₁ loss ₁ +λ ₂ loss ₂ Wherein lambda is ₁ And lambda (lambda) ₂ Is a given hyper-parameter.

According to another aspect of the present invention, there is provided a three-dimensional hand gesture estimation method, including: inputting the image to be identified into the trained three-dimensional hand gesture model to obtain a three-dimensional hand gesture estimation result of the target object; the image to be identified carries hand interaction information of the target object.

According to another aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

(1) The invention provides a construction method of a three-dimensional hand gesture model, which is characterized in that the interactivity among local three-dimensional anchor points is constructed to enable the model to have global space and context perception, so that joint information of joint points to be estimated is better captured, and the problems of self-shielding and mutual shielding of the interactive hand are reduced; meanwhile, as each three-dimensional anchor point is used for estimating three-dimensional offset of all the joints to be estimated, a model can learn which three-dimensional anchor points can have larger contribution to the joints to be estimated more accurately in a weight mode, so that larger weight is distributed; and finally, the set three-dimensional anchor point is positioned in a three-dimensional space, so that the prediction of the depth value in the hand gesture estimation has a more accurate estimation effect.

(2) The method solves the problems of poor accuracy and slow model running speed of the existing hand gesture estimation method, effectively solves the problems of strong shielding, strong similarity of characteristics and the like of the interactive hand gesture, and has the beneficial effects of strong robustness and good generalization after the large-scale non-tag data set training.

Drawings

Fig. 1 is a flow chart of a method for constructing a three-dimensional hand gesture model according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a method for constructing a three-dimensional hand gesture model according to an embodiment of the present invention.

Fig. 3 is an information flow diagram of a method provided by an embodiment of the invention during training and testing phases.

Fig. 4 is an arrangement of coding layers proposed in a method according to an embodiment of the present invention.

Fig. 5 is a setup of a decoding layer proposed in a method according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a process of setting a three-dimensional anchor point and predicting an node coordinate through the three-dimensional anchor point according to a method provided by an embodiment of the present invention.

Fig. 7 is a schematic diagram of three-dimensional pose estimation results and three-dimensional anchor point weights of different depth layers under different interactive hand image inputs according to a method provided by an embodiment of the present invention.

Fig. 8 is a schematic diagram of the weights of three-dimensional anchor points of different joint points on different depth layers according to a method provided by an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Example 1

As shown in fig. 1, the present invention provides a method for constructing a three-dimensional hand gesture model, including:

s3: inputting the second hand image into a feature enhancement model for feature extraction and feature enhancement to obtain three-dimensional offset and weight of each three-dimensional anchor point to all corresponding estimated joint points;

s4: fusing the three-dimensional offset of each three-dimensional anchor point to the corresponding estimated joint point, the weight and the three-dimensional coordinate information thereof to obtain the three-dimensional coordinate of the hand joint point corresponding to each three-dimensional anchor point;

s6: and inputting the training sample into a three-dimensional hand gesture estimation network to be trained to train, and adjusting network parameters by using a loss function in the training process to finally obtain a trained three-dimensional hand gesture model.

Fig. 2 is a schematic diagram of a method for constructing a three-dimensional hand gesture model according to an embodiment of the present invention. The three-dimensional hand gesture estimation task is to predict three-dimensional spatial position information of hand joints in a given depth image or RGB image, and is different from the one-hand gesture estimation task in that the gesture estimation task of an interactive hand needs to simultaneously predict positions of joints of two hands in three-dimensional space. Aiming at the problems of similar hand joint characteristics, serious interaction among hands, self-shielding and the like existing in a three-dimensional hand gesture estimation task based on an interactive hand, the three-dimensional hand gesture model construction method provided by the invention adopts a local three-dimensional anchor point setting mode, accurately estimates the details of the hand joint points through the local three-dimensional anchor point, and simultaneously adopts an attention mechanism to enable the model to have a link relation among global hand joint points in a captured image, acquire hand semantic information and finish three-dimensional gesture estimation of the interactive hand.

In this embodiment, unlike the modeling method that uses a hand parameter model to perform model fitting, the present algorithm can complete verification of a neural network on a three-dimensional hand joint point by only one RGB image without using any prior of the hand parameter model. The information flow diagram during the training phase and the testing phase is shown in fig. 3. Aiming at the three-dimensional hand gesture estimation task, the three-dimensional coordinate value of all hand joints is obtained by taking each three-dimensional anchor point as a local regressive device through a mode of using the local three-dimensional anchor points, regressing the three-dimensional offset of all hand joints and then weighting through a weight fusion mode.

Example 2

S2, setting uniformly and densely distributed three-dimensional anchor points in an image, and obtaining three-dimensional coordinate information of the three-dimensional anchor points in a space, wherein the method specifically comprises the following steps:

for the input 256×256 images, selecting the set step length of 16, uniformly dividing the images into 256 image blocks, setting a three-dimensional anchor point at the center of each image block, and generating the plane coordinates of the 256 three-dimensional anchor points; setting depth values at positions 10 cm before and after the central depth value by taking the root node coordinate depth value of the hand as the center and obtaining the depth coordinate of the three-dimensional anchor point by using the total three depth values. And fusing coordinates of the three-dimensional anchors to obtain 256 multiplied by 3 three-dimensional anchors which are uniformly distributed in the image and on the front plane position and the rear plane position of the image. The coordinate position of the generated three-dimensional anchor point is the plane coordinate position of the three-dimensional anchor point under the image coordinate system and the depth coordinate position under the world coordinate system. The coordinate position of the three-dimensional anchor point is saved when the three-dimensional anchor point is generated, so that the coordinate position is convenient to use when the coordinate of the joint point is predicted later.

Example 3

For each three-dimensional anchor point a, their plane coordinate position and depth coordinate position can be expressed as

And->

The plane coordinate offset and depth coordinate offset of the three-dimensional anchor point to the predicted articulation point are obtained by the above method and are expressed as

And->

The normalized weight of the three-dimensional anchor point to the predicted node represents the contribution value, expressed as

The planar coordinate position +.>

And depth coordinate position->

Example 4

Weight W for three-dimensional anchor point a to joint point j _j (a) It is normalized and weighted using Soft-max, with the formula:

wherein (1)>

Represents W _j (a) And (5) weighting the weighted weight.

Example 5

S3 comprises the following steps:

a1, inputting the RGB image into a characteristic enhancement model, and extracting and enhancing the characteristics of the image.

A2, obtaining the weight and the offset of the three-dimensional anchor point.

And A3, fusing the coordinate positions of the three-dimensional anchor points to obtain the three-dimensional coordinates of the hand joint points.

As shown in fig. 4 and fig. 5, A1 firstly performs multi-scale feature extraction on the cut RGB image (second hand image), extracts image features through a res net-50 network structure, selects output features of a three-layer network structure as a multi-scale feature pyramid, and fuses the multi-layer features to obtain input features. And then using the Encoder structure and the Decoder structure to further extract and enhance the image characteristics. A2 then obtains predicted weights and offsets through two multi-layer perceptron (Multilayer Perceptron, MLP) structures. As shown in fig. 6, A3 obtains the three-dimensional coordinates of the hand joint point by fusing the position coordinates of the three-dimensional anchor point.

Specifically, for the input fused feature pyramid (first image feature), a convolution layer and a Group-Normalization layer are used to process them into the same plane-size features. And then flattening and connecting the two images to obtain a second image feature, and adding the generated second image feature with the position code to obtain a third image feature. The formula of feature code is：P _xy PE (x, y), where PE operates by encoding position information in a feature by sinusoidal transformation, and x, y represents the coordinate position in the feature map.

Inputting the third image feature into a multi-scale Deformable Attention model, the formula being:

Q＝F+P _xy ,K＝ref(F),V＝F,

where F represents the input image feature and ref operation represents the selected reference point.

Through such a multi-scale Deformable Attention model, the features of the image are enhanced, where a total of 6 layers of encoder structure is chosen.

And for the enhanced image characteristics, extracting the three-dimensional anchor point characteristics in the image through a three-dimensional anchor point estimator to obtain fourth image characteristics, and carrying out characteristic fusion on the set three-dimensional anchor points.

Extracting three-dimensional anchor point characteristics in the image through a three-dimensional anchor point estimator, wherein the three-dimensional anchor point characteristics are defined as follows: a, a _q ＝(x _q ,y _q ,d _q ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein a is _q Represents the q-th three-dimensional anchor point, x _q ,y _q ,d _q Respectively representing the plane coordinate value and the depth value of the three-dimensional anchor point. For each three-dimensional anchor point, their spatial position coding features are expressed as: p (P) _q ＝MLP(PE(a _q ))。

For the feature output by the Decoder structure, it is input into two MLP structures respectively, one MLP structure is used for predicting the weight W _j (a) The weight value from the a-th three-dimensional anchor point to the j-th joint point is represented; another MLP structure is used to predict the coordinate offset O _j (a) The offset from the a-th three-dimensional anchor point to the j-th joint point is represented. Subsequently, the offset size O _j (a) Split into

And->

Respectively represent the a three-dimensional anchor point to the j-th switchThe plane offset size and depth offset size of the node.

Example 6

The MLP represents a multi-layer perceptron, and in a 6-layer decoding layer structure, each layer decoder contains such a structure, and the weights of the MLPs in all 6-layer decoding layer decoders are shared.

Each layer of decoder contains a self-attrition structure and a cross-attrition structure, and for self-attrition structure, the values input into the multiscale Deformable Attention model are: q=d+p _q ,K＝D+P _q V=d, where D represents the encoding of the decoder output feature. For the cross-section structure, the values entered into the multiscale Deformable Attention model are: q=d+p _q ,K＝a _q V=e, where K represents the reference-points selected as q three-dimensional anchor coordinate positions in the Deformable Attention model, and E represents the output of the encoder structure. Finally, through a 6-layer decoder structure, the model extracts the corresponding characteristics of each three-dimensional anchor point, and establishes interaction between the three-dimensional anchor points, namely, establishes the relation between the three-dimensional anchor point characteristics.

Example 7

In the training stage, a main stream interactive hand RGB data set and a given clipping frame are utilized to obtain a clipped RGB image. The large-scale public data sets used include the InterHand2.6M data set, as well as the depth data set NYU data set and the HANDS2017 data set, were used to verify the generalization of the model. And inputting the cut image containing the interactive hand information as a training sample into a three-dimensional hand gesture estimation network to be trained, and obtaining a three-dimensional hand gesture estimation result. And generating a first loss function according to the generated coordinates of the three-dimensional hand joint points and GT information, taking the first loss function as a first supervision signal to supervise the generated result of the acquired three-dimensional hand gestures, and simultaneously, generating a second loss function as a second supervision signal to supervise the distributed three-dimensional anchor point weights according to the set three-dimensional anchor point coordinate positions and the weights predicted for each three-dimensional anchor point.

In the test stage, the input interactive hand image is cut, the cut image is input into a three-dimensional posture estimation network based on a three-dimensional anchor point, and a three-dimensional hand posture estimation result can be obtained.

Example 8

Determining a first loss function, a second loss function and a total loss function according to the three-dimensional coordinates of the hand joint point, wherein the method comprises the following steps: the first loss function is:

wherein, the liquid crystal display device comprises a liquid crystal display device,

and->

GT value and predicted value respectively representing plane coordinates of jth articulation point, +.>

And->

The GT value and the predicted value, respectively, represent the depth coordinates of the j-th node. Alpha represents a given weight parameter, and the given value is 0.5. Wherein L is _τ (. Cndot.) loss represents a smooth L1 loss, expressed as:

wherein τ ₁ =1 and τ ₂ Given 1 and 3, respectively, a smooth depth loss value is calculated.

The second loss function is:

wherein τ ₁ And τ ₂ Also given as 1 and 3.

Alternatively, the total loss functionThe method comprises the following steps: loss=λ ₁ loss ₁ +λ ₂ loss ₂ ；

Wherein lambda is ₁ And lambda (lambda) ₂ For a given super parameter, setting 3 and 1, respectively, is used to balance the two loss functions.

According to the method, the first loss function is obtained through calculation of the predicted value and the GT value, so that the accuracy of prediction is improved; and calculating a second loss function according to the fusion result of the three-dimensional anchor point position and the prediction weight and the GT value of each node, and improving the generalization performance of the model. And training the network according to the loss function. In the test stage, the network directly predicts the coordinate position in the three-dimensional space of the hand by inputting the RGB picture of the cut hand region, and no model prior is needed, so that the method has accurate prediction result and higher running speed.

The test results shown in fig. 7 and 8 can be obtained in the network already trained in the training stage. In the test, only an RGB image is input to obtain the three-dimensional hand gesture estimation result of the target object.

Example 9

The invention provides a three-dimensional hand gesture estimation method, which comprises the following steps: inputting the image to be identified into the trained three-dimensional hand gesture model to obtain a three-dimensional hand gesture estimation result of the target object; the image to be identified carries hand interaction information of the target object.

Example 10

The present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method.

It will be readily appreciated by those skilled in the art that the foregoing is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The method for constructing the three-dimensional hand gesture model is characterized by comprising the following steps of:

2. The method for constructing a three-dimensional hand gesture model of claim 1, wherein S2 comprises:

3. The method for constructing a three-dimensional hand gesture model of claim 1, wherein S4 comprises:

And depth coordinate position

Wherein a is E A, < >>

And->

And->

4. The method of claim 3, wherein the three-dimensional hand gesture model is constructed,

5. A method of constructing a three-dimensional hand gesture model as claimed in claim 3 wherein the feature enhancement model comprises:

the multi-scale feature extractor is used for extracting features of the second hand image, selecting output features of the three-layer network structure as a multi-scale feature pyramid, fusing to obtain first image features, processing the first image features into the same plane size features by using a convolution layer and a group normalization layer, flattening and connecting the same to obtain second image features;

6. The method of claim 5, wherein the three-dimensional hand gesture model is constructed,

each coding layer comprises a cross self-attention structure and a cross attention structure, and the weights of the MLPs in all the coding layers are shared;

for the self-attention structure, the input values are: q=d+p _q ，K＝D+P _q V=d; d represents the fifth image feature, P _q Spatial position coding for three-dimensional anchor points, P _q ＝MLP(PE(a _q ))；

7. The method of claim 1, wherein the loss function comprises a first loss function and a second loss function; the step S5 comprises the following steps:

8. The method of claim 7, wherein the three-dimensional hand gesture model is constructed,

the first loss function is:

and->

And->

the second loss function is:

τ ₁ and τ ₂ Given as 1 and 3, respectively;

9. A three-dimensional hand gesture estimation method, comprising:

inputting an image to be identified into the trained three-dimensional hand gesture model of any one of claims 1-8 to obtain a three-dimensional hand gesture estimation result of the target object;

the image to be identified carries hand interaction information of the target object.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of claim 9.