CN116188695A - Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method - Google Patents

Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method Download PDF

Info

Publication number
CN116188695A
CN116188695A CN202310194731.3A CN202310194731A CN116188695A CN 116188695 A CN116188695 A CN 116188695A CN 202310194731 A CN202310194731 A CN 202310194731A CN 116188695 A CN116188695 A CN 116188695A
Authority
CN
China
Prior art keywords
dimensional
image
hand
anchor point
hand gesture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310194731.3A
Other languages
Chinese (zh)
Inventor
肖阳
姜昌龙
吴存霖
郑璟泓
曹治国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202310194731.3A priority Critical patent/CN116188695A/en
Publication of CN116188695A publication Critical patent/CN116188695A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Graphics (AREA)
  • Human Computer Interaction (AREA)
  • Geometry (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a construction method of a three-dimensional hand gesture model and a three-dimensional hand gesture estimation method, which belong to the technical field of three-dimensional gesture estimation and comprise the following steps: cutting an RGB image carrying interactive hand information, setting uniformly distributed three-dimensional anchor points on the cut image, and obtaining three-dimensional coordinate information of each three-dimensional anchor point; inputting the image with the three-dimensional anchor points into a characteristic enhancement model to obtain three-dimensional offset and weight of each three-dimensional anchor point to all corresponding estimated nodes; the three-dimensional offset and the weight thereof and the three-dimensional coordinate information are fused to obtain the three-dimensional coordinates of the hand joint points, and then a loss function is determined; and training a three-dimensional hand gesture estimation network by using the loss function to obtain a target three-dimensional hand gesture model. The method and the device have the advantages that the interactivity among the local three-dimensional anchor points is constructed, so that the local three-dimensional anchor points have the global space and the perception of the context, the joint information of the joint points to be estimated is better captured, and the problems of self-shielding and mutual shielding of the interactive hands are reduced.

Description

Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method
Technical Field
The invention belongs to the technical field of image posture estimation, and particularly relates to a construction method of a three-dimensional hand posture model and a three-dimensional hand posture estimation method.
Background
There are many three-dimensional hand pose estimation methods currently available. The existing three-dimensional hand gesture estimation method can be mainly divided into the following two types.
First three-dimensional hand pose estimation based on depth image: the commonly used method is to extract features on a depth map, and then obtain coordinate position information of each joint point by using a regression method. The three-dimensional information of the hand is generated by regression prediction through a neural network by using a regression method, a probability density function is generated for each joint point of the hand by using the structures of the coding layer and the decoding layer, the probability density function is called a heat map, and the coordinate position of the predicted joint point is obtained from the position with the highest probability in the heat map. Or extracting information in the image by using a graph convolutional neural network, and obtaining three branches of the weight of each three-dimensional anchor point, the plane offset and the depth offset of each three-dimensional anchor point to other joint points on an original depth graph by using a two-dimensional convolutional neural network, and obtaining the result of the three-dimensional posture estimation of the human body by a mode of weighting and summing the three-dimensional anchor points.
The second type of three-dimensional hand pose estimation based on RGB images is challenging to predict the three-dimensional joint positions of the hand from the RGB images because the RGB images lack three-dimensional information. Segmentation of the hand region from the image, pose estimation, and fitting or mapping it to three-dimensional space to infer its position is accomplished using a network structure such as PoseNet. In order to solve the problem of hand occlusion, a multi-view method is used for hand pose estimation, which is effective in most view angles, and a single hand does not work well in cases where the hand is severely occluded. Current methods, while capable of improving the problems of occlusion and viewing angle variation that RGB images have, have their limitations with a single hand. For multiple hand targets in an image, or when a hand interacts with other objects, it will be difficult to identify them.
However, there are still many difficulties with the task of interactive hand pose estimation for RGB images, which can be broadly divided into two aspects: firstly, in the interaction task of the hand, shielding brings a lot of challenges to the prediction of the interaction hand; on the other hand, the corresponding node features of the left and right hands are very similar, resulting in difficulty in distinguishing the features of each hand node. The problems that the prediction result of the shielding condition is poor, the operation speed is low, the hand joint points cannot be effectively distinguished and the like are difficult to be well solved by the method.
Disclosure of Invention
Aiming at the defects or improvement demands of the prior art, the invention provides a construction method of a three-dimensional hand gesture model and a three-dimensional hand gesture estimation method, which aim to ensure that the three-dimensional hand gesture model has global space and context perception by constructing interactivity between local three-dimensional anchor points, so that joint information of joint points to be estimated is better captured, and the problems of self-shielding and mutual shielding frequently occurring in interactive hands are weakened, thereby solving the technical problem of low accuracy of three-dimensional hand gesture estimation caused by shielding in the prior art.
In order to achieve the above object, according to one aspect of the present invention, there is provided a method for constructing a three-dimensional hand gesture model, including:
s1: cutting the RGB image carrying the interactive hand information to obtain a first hand image;
s2: setting a plurality of uniformly distributed three-dimensional anchor points on the first hand image to obtain a second hand image, and obtaining three-dimensional coordinate information of each three-dimensional anchor point in space;
s3: inputting the second hand image into a feature enhancement model for feature extraction and feature enhancement to obtain three-dimensional offset and weight of each three-dimensional anchor point to all corresponding estimated nodes;
s4: fusing the three-dimensional offset of each three-dimensional anchor point to the corresponding estimated joint point, the weight and the three-dimensional coordinate information of the three-dimensional anchor point to obtain the three-dimensional coordinate of the hand joint point corresponding to each three-dimensional anchor point;
s5: determining a loss function according to the three-dimensional coordinates of each hand joint point;
s6: and inputting the training sample into a three-dimensional hand gesture estimation network to be trained to train, and adjusting network parameters by using the loss function in the training process to finally obtain a trained three-dimensional hand gesture model.
In one embodiment, the S2 includes:
uniformly dividing the first hand image into N image blocks, setting a three-dimensional anchor point at the center of each image block and acquiring the plane coordinates of each image block;
setting a depth value respectively in front of and behind the depth value of the center in a world coordinate system by taking the depth value of the root node coordinate of the hand in each image block as the center, and obtaining three depth coordinates of the three-dimensional anchor point;
and fusing the plane coordinates of the N three-dimensional anchor points and the corresponding three depth coordinates to obtain 3N three-dimensional coordinate information corresponding to the N three-dimensional anchor points.
In one embodiment, the S4 includes:
the three-dimensional coordinates of the estimated joint point corresponding to the three-dimensional anchor point a are expressed as: plane coordinates
Figure BDA0004106729170000031
And depth coordinate position->
Figure BDA0004106729170000032
Figure BDA0004106729170000033
Wherein a is E A, < >>
Figure BDA0004106729170000034
And->
Figure BDA0004106729170000035
Respectively representing the plane coordinate and the depth coordinate position of the three-dimensional anchor point a; />
Figure BDA0004106729170000036
And->
Figure BDA0004106729170000037
Respectively representing the plane coordinate offset and the depth coordinate offset from the three-dimensional anchor point a to the corresponding estimated joint point j; />
Figure BDA0004106729170000038
The normalized weight representing the three-dimensional anchor point a to the corresponding estimated node j represents its contribution value.
In one of the embodiments of the present invention,
Figure BDA0004106729170000039
W j (a) And the weight from the three-dimensional anchor point a to the corresponding estimated joint point j is represented.
In one embodiment, the feature enhancement model includes:
the multi-scale feature extractor is used for extracting features of the second hand image through a ResNet-50 network structure, selecting output features of a three-layer network structure as a multi-scale feature pyramid, fusing to obtain first image features, processing the first image features into the same plane size features by using a convolution layer and a group normalization layer, and flattening and connecting the same to obtain second image features;
the feature enhancement module is used for inputting the spatial position codes of the second image features and the three-dimensional anchor points into a plurality of coding layers for enhancement to obtain third image features;
the three-dimensional anchor point feature interaction module is used for extracting three-dimensional anchor point features in the second hand image through a three-dimensional anchor point estimator to obtain a fourth image feature, and inputting the fourth image feature and the third image feature into a plurality of decoding layers for decoding so as to output a fifth image feature;
an offset-weight prediction module for receiving the fifth image feature by using two multi-layer perceptron MLPs, one MLP predicting the coordinate offset O j (a) Another MLP prediction weight W j (a)。
In one embodiment, each coding layer contains a cross-attention structure and a self-attention structure, and the weights of the MLPs in all coding layers are shared;
for the self-attention structure, the input values are: q=d+p q ,K=D+P q V=d; d represents the fifth image feature, P q Spatial position coding for three-dimensional anchor points, denoted P q =MLP(PE(a q ));
For the cross-attention structure, the input values are: q=d+p q ,K=a q V=e, K represents the selection of q three-dimensional anchor coordinate positions for the reference point in the deformable attention model, E representing the third image feature;
all the coding layers extract the third image features corresponding to each three-dimensional anchor point and establish interaction between the three-dimensional anchor points.
In one embodiment, the loss function includes a first loss function and a second loss function; the step S5 comprises the following steps:
generating a first loss function according to the three-dimensional coordinates and the true value information of each hand joint point, and supervising the estimation result of the obtained three-dimensional hand gesture as a first supervision signal;
and generating a second loss function according to the three-dimensional coordinates of each hand joint point and the weight corresponding to each three-dimensional anchor point, and taking the second loss function as a second supervision signal to supervise the distributed weight of the three-dimensional anchor point.
In one embodiment, the first loss function is:
Figure BDA0004106729170000041
Figure BDA0004106729170000042
and->
Figure BDA0004106729170000043
Respectively representing the true value and the predicted value of the plane coordinate of the j-th joint point, +.>
Figure BDA0004106729170000051
And->
Figure BDA0004106729170000052
Respectively representing a true value and a predicted value of the depth coordinate of the jth joint point; alpha represents a given weight, L τ (. Cndot.) loss represents a smooth L1 loss;
the second loss function is:
Figure BDA0004106729170000053
τ 1 and τ 2 Given 1 and 3, respectively, to calculate a smoothed depth loss value;
the loss function is: loss=λ 1 loss 12 loss 2 Wherein lambda is 1 And lambda (lambda) 2 Is a given hyper-parameter.
According to another aspect of the present invention, there is provided a three-dimensional hand gesture estimation method, including: inputting the image to be identified into the trained three-dimensional hand gesture model to obtain a three-dimensional hand gesture estimation result of the target object; the image to be identified carries hand interaction information of the target object.
According to another aspect of the invention, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method.
In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:
(1) The invention provides a construction method of a three-dimensional hand gesture model, which is characterized in that the interactivity among local three-dimensional anchor points is constructed to enable the model to have global space and context perception, so that joint information of joint points to be estimated is better captured, and the problems of self-shielding and mutual shielding of the interactive hand are reduced; meanwhile, as each three-dimensional anchor point is used for estimating three-dimensional offset of all the joints to be estimated, a model can learn which three-dimensional anchor points can have larger contribution to the joints to be estimated more accurately in a weight mode, so that larger weight is distributed; and finally, the set three-dimensional anchor point is positioned in a three-dimensional space, so that the prediction of the depth value in the hand gesture estimation has a more accurate estimation effect.
(2) The method solves the problems of poor accuracy and slow model running speed of the existing hand gesture estimation method, effectively solves the problems of strong shielding, strong similarity of characteristics and the like of the interactive hand gesture, and has the beneficial effects of strong robustness and good generalization after the large-scale non-tag data set training.
Drawings
Fig. 1 is a flow chart of a method for constructing a three-dimensional hand gesture model according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a method for constructing a three-dimensional hand gesture model according to an embodiment of the present invention.
Fig. 3 is an information flow diagram of a method provided by an embodiment of the invention during training and testing phases.
Fig. 4 is an arrangement of coding layers proposed in a method according to an embodiment of the present invention.
Fig. 5 is a setup of a decoding layer proposed in a method according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a process of setting a three-dimensional anchor point and predicting an node coordinate through the three-dimensional anchor point according to a method provided by an embodiment of the present invention.
Fig. 7 is a schematic diagram of three-dimensional pose estimation results and three-dimensional anchor point weights of different depth layers under different interactive hand image inputs according to a method provided by an embodiment of the present invention.
Fig. 8 is a schematic diagram of the weights of three-dimensional anchor points of different joint points on different depth layers according to a method provided by an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
Example 1
As shown in fig. 1, the present invention provides a method for constructing a three-dimensional hand gesture model, including:
s1: cutting the RGB image carrying the interactive hand information to obtain a first hand image;
s2: setting a plurality of uniformly distributed three-dimensional anchor points on the first hand image to obtain a second hand image, and obtaining three-dimensional coordinate information of each three-dimensional anchor point in space;
s3: inputting the second hand image into a feature enhancement model for feature extraction and feature enhancement to obtain three-dimensional offset and weight of each three-dimensional anchor point to all corresponding estimated joint points;
s4: fusing the three-dimensional offset of each three-dimensional anchor point to the corresponding estimated joint point, the weight and the three-dimensional coordinate information thereof to obtain the three-dimensional coordinate of the hand joint point corresponding to each three-dimensional anchor point;
s5: determining a loss function according to the three-dimensional coordinates of each hand joint point;
s6: and inputting the training sample into a three-dimensional hand gesture estimation network to be trained to train, and adjusting network parameters by using a loss function in the training process to finally obtain a trained three-dimensional hand gesture model.
Fig. 2 is a schematic diagram of a method for constructing a three-dimensional hand gesture model according to an embodiment of the present invention. The three-dimensional hand gesture estimation task is to predict three-dimensional spatial position information of hand joints in a given depth image or RGB image, and is different from the one-hand gesture estimation task in that the gesture estimation task of an interactive hand needs to simultaneously predict positions of joints of two hands in three-dimensional space. Aiming at the problems of similar hand joint characteristics, serious interaction among hands, self-shielding and the like existing in a three-dimensional hand gesture estimation task based on an interactive hand, the three-dimensional hand gesture model construction method provided by the invention adopts a local three-dimensional anchor point setting mode, accurately estimates the details of the hand joint points through the local three-dimensional anchor point, and simultaneously adopts an attention mechanism to enable the model to have a link relation among global hand joint points in a captured image, acquire hand semantic information and finish three-dimensional gesture estimation of the interactive hand.
In this embodiment, unlike the modeling method that uses a hand parameter model to perform model fitting, the present algorithm can complete verification of a neural network on a three-dimensional hand joint point by only one RGB image without using any prior of the hand parameter model. The information flow diagram during the training phase and the testing phase is shown in fig. 3. Aiming at the three-dimensional hand gesture estimation task, the three-dimensional coordinate value of all hand joints is obtained by taking each three-dimensional anchor point as a local regressive device through a mode of using the local three-dimensional anchor points, regressing the three-dimensional offset of all hand joints and then weighting through a weight fusion mode.
Example 2
S2, setting uniformly and densely distributed three-dimensional anchor points in an image, and obtaining three-dimensional coordinate information of the three-dimensional anchor points in a space, wherein the method specifically comprises the following steps:
for the input 256×256 images, selecting the set step length of 16, uniformly dividing the images into 256 image blocks, setting a three-dimensional anchor point at the center of each image block, and generating the plane coordinates of the 256 three-dimensional anchor points; setting depth values at positions 10 cm before and after the central depth value by taking the root node coordinate depth value of the hand as the center and obtaining the depth coordinate of the three-dimensional anchor point by using the total three depth values. And fusing coordinates of the three-dimensional anchors to obtain 256 multiplied by 3 three-dimensional anchors which are uniformly distributed in the image and on the front plane position and the rear plane position of the image. The coordinate position of the generated three-dimensional anchor point is the plane coordinate position of the three-dimensional anchor point under the image coordinate system and the depth coordinate position under the world coordinate system. The coordinate position of the three-dimensional anchor point is saved when the three-dimensional anchor point is generated, so that the coordinate position is convenient to use when the coordinate of the joint point is predicted later.
Example 3
For each three-dimensional anchor point a, their plane coordinate position and depth coordinate position can be expressed as
Figure BDA0004106729170000081
And->
Figure BDA0004106729170000082
The plane coordinate offset and depth coordinate offset of the three-dimensional anchor point to the predicted articulation point are obtained by the above method and are expressed as
Figure BDA0004106729170000083
And->
Figure BDA0004106729170000084
The normalized weight of the three-dimensional anchor point to the predicted node represents the contribution value, expressed as
Figure BDA0004106729170000085
The planar coordinate position +.>
Figure BDA0004106729170000086
And depth coordinate position->
Figure BDA0004106729170000087
Figure BDA0004106729170000088
Example 4
Weight W for three-dimensional anchor point a to joint point j j (a) It is normalized and weighted using Soft-max, with the formula:
Figure BDA0004106729170000089
wherein (1)>
Figure BDA00041067291700000810
Represents W j (a) And (5) weighting the weighted weight.
Example 5
S3 comprises the following steps:
a1, inputting the RGB image into a characteristic enhancement model, and extracting and enhancing the characteristics of the image.
A2, obtaining the weight and the offset of the three-dimensional anchor point.
And A3, fusing the coordinate positions of the three-dimensional anchor points to obtain the three-dimensional coordinates of the hand joint points.
As shown in fig. 4 and fig. 5, A1 firstly performs multi-scale feature extraction on the cut RGB image (second hand image), extracts image features through a res net-50 network structure, selects output features of a three-layer network structure as a multi-scale feature pyramid, and fuses the multi-layer features to obtain input features. And then using the Encoder structure and the Decoder structure to further extract and enhance the image characteristics. A2 then obtains predicted weights and offsets through two multi-layer perceptron (Multilayer Perceptron, MLP) structures. As shown in fig. 6, A3 obtains the three-dimensional coordinates of the hand joint point by fusing the position coordinates of the three-dimensional anchor point.
Specifically, for the input fused feature pyramid (first image feature), a convolution layer and a Group-Normalization layer are used to process them into the same plane-size features. And then flattening and connecting the two images to obtain a second image feature, and adding the generated second image feature with the position code to obtain a third image feature. The formula of feature code is:P xy PE (x, y), where PE operates by encoding position information in a feature by sinusoidal transformation, and x, y represents the coordinate position in the feature map.
Inputting the third image feature into a multi-scale Deformable Attention model, the formula being:
Q=F+P xy ,K=ref(F),V=F,
where F represents the input image feature and ref operation represents the selected reference point.
Through such a multi-scale Deformable Attention model, the features of the image are enhanced, where a total of 6 layers of encoder structure is chosen.
And for the enhanced image characteristics, extracting the three-dimensional anchor point characteristics in the image through a three-dimensional anchor point estimator to obtain fourth image characteristics, and carrying out characteristic fusion on the set three-dimensional anchor points.
Extracting three-dimensional anchor point characteristics in the image through a three-dimensional anchor point estimator, wherein the three-dimensional anchor point characteristics are defined as follows: a, a q =(x q ,y q ,d q ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein a is q Represents the q-th three-dimensional anchor point, x q ,y q ,d q Respectively representing the plane coordinate value and the depth value of the three-dimensional anchor point. For each three-dimensional anchor point, their spatial position coding features are expressed as: p (P) q =MLP(PE(a q ))。
For the feature output by the Decoder structure, it is input into two MLP structures respectively, one MLP structure is used for predicting the weight W j (a) The weight value from the a-th three-dimensional anchor point to the j-th joint point is represented; another MLP structure is used to predict the coordinate offset O j (a) The offset from the a-th three-dimensional anchor point to the j-th joint point is represented. Subsequently, the offset size O j (a) Split into
Figure BDA0004106729170000101
And->
Figure BDA0004106729170000102
Respectively represent the a three-dimensional anchor point to the j-th switchThe plane offset size and depth offset size of the node.
Example 6
The MLP represents a multi-layer perceptron, and in a 6-layer decoding layer structure, each layer decoder contains such a structure, and the weights of the MLPs in all 6-layer decoding layer decoders are shared.
Each layer of decoder contains a self-attrition structure and a cross-attrition structure, and for self-attrition structure, the values input into the multiscale Deformable Attention model are: q=d+p q ,K=D+P q V=d, where D represents the encoding of the decoder output feature. For the cross-section structure, the values entered into the multiscale Deformable Attention model are: q=d+p q ,K=a q V=e, where K represents the reference-points selected as q three-dimensional anchor coordinate positions in the Deformable Attention model, and E represents the output of the encoder structure. Finally, through a 6-layer decoder structure, the model extracts the corresponding characteristics of each three-dimensional anchor point, and establishes interaction between the three-dimensional anchor points, namely, establishes the relation between the three-dimensional anchor point characteristics.
Example 7
In the training stage, a main stream interactive hand RGB data set and a given clipping frame are utilized to obtain a clipped RGB image. The large-scale public data sets used include the InterHand2.6M data set, as well as the depth data set NYU data set and the HANDS2017 data set, were used to verify the generalization of the model. And inputting the cut image containing the interactive hand information as a training sample into a three-dimensional hand gesture estimation network to be trained, and obtaining a three-dimensional hand gesture estimation result. And generating a first loss function according to the generated coordinates of the three-dimensional hand joint points and GT information, taking the first loss function as a first supervision signal to supervise the generated result of the acquired three-dimensional hand gestures, and simultaneously, generating a second loss function as a second supervision signal to supervise the distributed three-dimensional anchor point weights according to the set three-dimensional anchor point coordinate positions and the weights predicted for each three-dimensional anchor point.
In the test stage, the input interactive hand image is cut, the cut image is input into a three-dimensional posture estimation network based on a three-dimensional anchor point, and a three-dimensional hand posture estimation result can be obtained.
Example 8
Determining a first loss function, a second loss function and a total loss function according to the three-dimensional coordinates of the hand joint point, wherein the method comprises the following steps: the first loss function is:
Figure BDA0004106729170000111
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004106729170000112
and->
Figure BDA0004106729170000113
GT value and predicted value respectively representing plane coordinates of jth articulation point, +.>
Figure BDA0004106729170000114
And->
Figure BDA0004106729170000115
The GT value and the predicted value, respectively, represent the depth coordinates of the j-th node. Alpha represents a given weight parameter, and the given value is 0.5. Wherein L is τ (. Cndot.) loss represents a smooth L1 loss, expressed as:
Figure BDA0004106729170000116
wherein τ 1 =1 and τ 2 Given 1 and 3, respectively, a smooth depth loss value is calculated.
The second loss function is:
Figure BDA0004106729170000117
wherein τ 1 And τ 2 Also given as 1 and 3.
Alternatively, the total loss functionThe method comprises the following steps: loss=λ 1 loss 12 loss 2
Wherein lambda is 1 And lambda (lambda) 2 For a given super parameter, setting 3 and 1, respectively, is used to balance the two loss functions.
According to the method, the first loss function is obtained through calculation of the predicted value and the GT value, so that the accuracy of prediction is improved; and calculating a second loss function according to the fusion result of the three-dimensional anchor point position and the prediction weight and the GT value of each node, and improving the generalization performance of the model. And training the network according to the loss function. In the test stage, the network directly predicts the coordinate position in the three-dimensional space of the hand by inputting the RGB picture of the cut hand region, and no model prior is needed, so that the method has accurate prediction result and higher running speed.
The test results shown in fig. 7 and 8 can be obtained in the network already trained in the training stage. In the test, only an RGB image is input to obtain the three-dimensional hand gesture estimation result of the target object.
Example 9
The invention provides a three-dimensional hand gesture estimation method, which comprises the following steps: inputting the image to be identified into the trained three-dimensional hand gesture model to obtain a three-dimensional hand gesture estimation result of the target object; the image to be identified carries hand interaction information of the target object.
Example 10
The present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method.
It will be readily appreciated by those skilled in the art that the foregoing is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. The method for constructing the three-dimensional hand gesture model is characterized by comprising the following steps of:
s1: cutting the RGB image carrying the interactive hand information to obtain a first hand image;
s2: setting a plurality of uniformly distributed three-dimensional anchor points on the first hand image to obtain a second hand image, and obtaining three-dimensional coordinate information of each three-dimensional anchor point in space;
s3: inputting the second hand image into a feature enhancement model for feature extraction and feature enhancement to obtain three-dimensional offset and weight of each three-dimensional anchor point to all corresponding estimated nodes;
s4: fusing the three-dimensional offset of each three-dimensional anchor point to the corresponding estimated joint point, the weight and the three-dimensional coordinate information of the three-dimensional anchor point to obtain the three-dimensional coordinate of the hand joint point corresponding to each three-dimensional anchor point;
s5: determining a loss function according to the three-dimensional coordinates of each hand joint point;
s6: and inputting the training sample into a three-dimensional hand gesture estimation network to be trained to train, and adjusting network parameters by using the loss function in the training process to finally obtain a trained three-dimensional hand gesture model.
2. The method for constructing a three-dimensional hand gesture model of claim 1, wherein S2 comprises:
uniformly dividing the first hand image into N image blocks, setting a three-dimensional anchor point at the center of each image block and acquiring the plane coordinates of each image block;
setting a depth value respectively in front of and behind the depth value of the center in a world coordinate system by taking the depth value of the root node coordinate of the hand in each image block as the center, and obtaining three depth coordinates of the three-dimensional anchor point;
and fusing the plane coordinates of the N three-dimensional anchor points and the corresponding three depth coordinates to obtain 3N three-dimensional coordinate information corresponding to the N three-dimensional anchor points.
3. The method for constructing a three-dimensional hand gesture model of claim 1, wherein S4 comprises:
the three-dimensional coordinates of the estimated joint point corresponding to the three-dimensional anchor point a are expressed as: plane coordinates
Figure FDA0004106729140000011
And depth coordinate position
Figure FDA0004106729140000021
Wherein a is E A, < >>
Figure FDA0004106729140000022
And->
Figure FDA0004106729140000023
Respectively representing the plane coordinate and the depth coordinate position of the three-dimensional anchor point a; />
Figure FDA0004106729140000024
And->
Figure FDA0004106729140000025
Respectively representing the plane coordinate offset and the depth coordinate offset from the three-dimensional anchor point a to the corresponding estimated joint point j; />
Figure FDA0004106729140000026
The normalized weight representing the three-dimensional anchor point a to the corresponding estimated node j represents its contribution value.
4. The method of claim 3, wherein the three-dimensional hand gesture model is constructed,
Figure FDA0004106729140000027
W j (a) And the weight from the three-dimensional anchor point a to the corresponding estimated joint point j is represented.
5. A method of constructing a three-dimensional hand gesture model as claimed in claim 3 wherein the feature enhancement model comprises:
the multi-scale feature extractor is used for extracting features of the second hand image, selecting output features of the three-layer network structure as a multi-scale feature pyramid, fusing to obtain first image features, processing the first image features into the same plane size features by using a convolution layer and a group normalization layer, flattening and connecting the same to obtain second image features;
the feature enhancement module is used for inputting the spatial position codes of the second image features and the three-dimensional anchor points into a plurality of coding layers for enhancement to obtain third image features;
the three-dimensional anchor point feature interaction module is used for extracting three-dimensional anchor point features in the second hand image through a three-dimensional anchor point estimator to obtain a fourth image feature, and inputting the fourth image feature and the third image feature into a plurality of decoding layers for decoding so as to output a fifth image feature;
an offset-weight prediction module for receiving the fifth image feature by using two multi-layer perceptron MLPs, one MLP predicting the coordinate offset O j (a) Another MLP prediction weight W j (a)。
6. The method of claim 5, wherein the three-dimensional hand gesture model is constructed,
each coding layer comprises a cross self-attention structure and a cross attention structure, and the weights of the MLPs in all the coding layers are shared;
for the self-attention structure, the input values are: q=d+p q ,K=D+P q V=d; d represents the fifth image feature, P q Spatial position coding for three-dimensional anchor points, P q =MLP(PE(a q ));
For the cross-attention structure, the input values are: q=d+p q ,K=a q V=e, K represents the selection of q three-dimensional anchor coordinate positions for the reference point in the deformable attention model, E representing the third image feature;
all the coding layers extract the third image features corresponding to each three-dimensional anchor point and establish interaction between the three-dimensional anchor points.
7. The method of claim 1, wherein the loss function comprises a first loss function and a second loss function; the step S5 comprises the following steps:
generating a first loss function according to the three-dimensional coordinates and the true value information of each hand joint point, and supervising the estimation result of the obtained three-dimensional hand gesture as a first supervision signal;
and generating a second loss function according to the three-dimensional coordinates of each hand joint point and the weight corresponding to each three-dimensional anchor point, and taking the second loss function as a second supervision signal to supervise the distributed weight of the three-dimensional anchor point.
8. The method of claim 7, wherein the three-dimensional hand gesture model is constructed,
the first loss function is:
Figure FDA0004106729140000031
Figure FDA0004106729140000032
and->
Figure FDA0004106729140000033
Respectively representing the true value and the predicted value of the plane coordinate of the j-th joint point, +.>
Figure FDA0004106729140000034
And->
Figure FDA0004106729140000035
Respectively representing a true value and a predicted value of the depth coordinate of the jth joint point; alpha represents a given weight, L τ (. Cndot.) loss represents a smooth L1 loss;
the second loss function is:
Figure FDA0004106729140000036
τ 1 and τ 2 Given as 1 and 3, respectively;
the loss function is: loss=λ 1 loss 12 loss 2 Wherein lambda is 1 And lambda (lambda) 2 Is a given hyper-parameter.
9. A three-dimensional hand gesture estimation method, comprising:
inputting an image to be identified into the trained three-dimensional hand gesture model of any one of claims 1-8 to obtain a three-dimensional hand gesture estimation result of the target object;
the image to be identified carries hand interaction information of the target object.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of claim 9.
CN202310194731.3A 2023-02-28 2023-02-28 Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method Pending CN116188695A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310194731.3A CN116188695A (en) 2023-02-28 2023-02-28 Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310194731.3A CN116188695A (en) 2023-02-28 2023-02-28 Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method

Publications (1)

Publication Number Publication Date
CN116188695A true CN116188695A (en) 2023-05-30

Family

ID=86432531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310194731.3A Pending CN116188695A (en) 2023-02-28 2023-02-28 Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method

Country Status (1)

Country Link
CN (1) CN116188695A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740290A (en) * 2023-08-15 2023-09-12 江西农业大学 Three-dimensional interaction double-hand reconstruction method and system based on deformable attention
CN116766213A (en) * 2023-08-24 2023-09-19 烟台大学 Bionic hand control method, system and equipment based on image processing
CN117953545A (en) * 2024-03-27 2024-04-30 江汉大学 Three-dimensional hand gesture estimation method, device and processing equipment based on color image

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740290A (en) * 2023-08-15 2023-09-12 江西农业大学 Three-dimensional interaction double-hand reconstruction method and system based on deformable attention
CN116740290B (en) * 2023-08-15 2023-11-07 江西农业大学 Three-dimensional interaction double-hand reconstruction method and system based on deformable attention
CN116766213A (en) * 2023-08-24 2023-09-19 烟台大学 Bionic hand control method, system and equipment based on image processing
CN116766213B (en) * 2023-08-24 2023-11-03 烟台大学 Bionic hand control method, system and equipment based on image processing
CN117953545A (en) * 2024-03-27 2024-04-30 江汉大学 Three-dimensional hand gesture estimation method, device and processing equipment based on color image

Similar Documents

Publication Publication Date Title
WO2020156148A1 (en) Method for training smpl parameter prediction model, computer device, and storage medium
CN116188695A (en) Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method
CN108062536B (en) Detection method and device and computer storage medium
CN107045631B (en) Method, device and equipment for detecting human face characteristic points
WO2020107847A1 (en) Bone point-based fall detection method and fall detection device therefor
CN111968165B (en) Dynamic human body three-dimensional model complement method, device, equipment and medium
Wu et al. Handmap: Robust hand pose estimation via intermediate dense guidance map supervision
CN110598590A (en) Close interaction human body posture estimation method and device based on multi-view camera
US11113571B2 (en) Target object position prediction and motion tracking
CN108805151B (en) Image classification method based on depth similarity network
WO2021051526A1 (en) Multi-view 3d human pose estimation method and related apparatus
Su et al. Uncertainty guided multi-view stereo network for depth estimation
Wei et al. Bidirectional hybrid LSTM based recurrent neural network for multi-view stereo
CN109784295B (en) Video stream feature identification method, device, equipment and storage medium
CN117711066A (en) Three-dimensional human body posture estimation method, device, equipment and medium
Makris et al. Robust 3d human pose estimation guided by filtered subsets of body keypoints
CN117593762A (en) Human body posture estimation method, device and medium integrating vision and pressure
CN115546491B (en) Fall alarm method, system, electronic equipment and storage medium
CN115205737B (en) Motion real-time counting method and system based on transducer model
Faujdar et al. Human pose estimation using artificial intelligence with virtual gym tracker
CN110930482A (en) Human hand bone parameter determination method and device, electronic equipment and storage medium
CN115830707A (en) Multi-view human behavior identification method based on hypergraph learning
CN116824686A (en) Action recognition method and related device
CN115273219A (en) Yoga action evaluation method and system, storage medium and electronic equipment
Hua et al. Dual attention based multi-scale feature fusion network for indoor RGBD semantic segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination