CN110348359B

CN110348359B - Hand gesture tracking method, device and system

Info

Publication number: CN110348359B
Application number: CN201910599290.9A
Authority: CN
Inventors: 齐越; 车云龙
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2022-01-04
Anticipated expiration: 2039-07-04
Also published as: CN110348359A

Abstract

The invention provides a method, a device and a system for tracking hand gestures, wherein the method comprises the steps of obtaining a depth image containing a hand area; acquiring a target hand gesture through a target network model according to the depth image; the target network model is a learning model which is trained in advance and used for performing feature extraction and feature recognition on the depth image to obtain a target initial hand posture, optimizing the target initial hand posture to output the target initial hand posture, and improving the accuracy and the efficiency of posture tracking. Real-time hand gesture tracking can be achieved without high computing device resources, such as a GPU (Graphics Processing Unit).

Description

Hand gesture tracking method, device and system

Technical Field

The invention relates to the technical field of computer vision, in particular to a method, a device and a system for tracking hand gestures.

Background

With the popularization of depth sensors and the requirements of the field of human-computer interaction, the hand gesture recognition and tracking based on depth data is one of indispensable important ways in a novel human-computer interaction technology, and has wide application prospects, such as gesture remote control of unmanned aerial vehicles, gesture control of household robots, motion sensing games, auxiliary medical operations and the like. In contrast to conventional RGB image-based hand pose estimation, depth data may provide three-dimensional distance information for the hand.

The hand gesture recognition in the existing human-computer interaction requires that the hands of a user are parallel to the imaging plane of a camera, however, the hand gestures of people often form a certain angle with the horizontal plane, so that the accuracy of hand gesture recognition and tracking is not high.

Or most of the existing hand tracking methods use fixed hand models, different users have hands with different sizes, the hands using the fixed templates can reduce tracking precision, hand shape calibration needs to be performed in advance according to different user usage sometimes, operation is complex, gesture tracking efficiency is low, and user experience is reduced.

Disclosure of Invention

The invention provides a method, a device and a system for tracking hand gestures, which are used for improving the accuracy and the efficiency of gesture tracking and can realize real-time gesture tracking without high computing equipment resources (such as a GPU (Graphics Processing Unit)).

In a first aspect, a method for tracking hand gestures provided by an embodiment of the present invention includes:

acquiring a depth image containing a hand region;

acquiring a target hand gesture through a target network model according to the depth image; the target network model is a learning model which is trained in advance and used for performing feature extraction and feature recognition on the depth image to obtain a target initial hand posture, and optimizing the target initial hand posture to output the target hand posture.

In one possible design, acquiring a depth image containing a hand region includes: a depth camera is used to capture a depth image containing a hand region.

In one possible design, before obtaining the target hand pose through the target network model according to the depth image, the method further includes:

constructing a posture initialization network model, wherein the posture initialization network model comprises a hand global positioning branch and a hand posture classification branch; the hand global positioning branch is used for extracting feature points of a hand in the depth image and outputting a hand global posture according to the feature points; the hand gesture classification branch is used for extracting feature points of the depth image, and performing matching classification according to the feature points of a preset hand reference gesture and the feature points of the depth image to obtain the current local gesture of the hand; and training the attitude optimization module through the training data set to obtain a target network model.

In one possible design of the system, the system may be,

the gesture optimization module is specifically configured to match and fuse the global hand gesture and the current local hand gesture to obtain the target initial hand gesture, and optimize the target initial hand gesture by using a target equation to obtain the target hand gesture.

In one possible design, the posture optimization module comprises an objective equation of a plurality of optimization function terms, and the optimization function terms of the posture optimization module are constructed through preset constraint conditions, wherein the optimization function terms comprise a plurality of optimization function terms corresponding to hand posture optimization and optimization function terms corresponding to hand shape optimization.

In a second aspect, an apparatus for tracking hand gestures provided by an embodiment of the present invention applies the method according to any one of the first aspect, where the target network model includes: the hand global positioning branch and the hand posture classification branch are respectively composed of a plurality of convolution layers, a normalization layer, a relu active layer, a pooling layer, a non-pooling layer, a thermal layer, a full-connection layer, a relu active layer and a softmax layer.

In one possible design, the hand global positioning branch comprises: the system comprises a plurality of convolutional layers, a normalization layer, a relu activation layer, a pooling layer and a non-pooling layer, wherein the convolutional layers are used for extracting feature points of a hand in a depth image; the normalization layer is used for setting a numerical range of the characteristic points; the relu activation layer is used for outputting the feature points into a feature map with enhanced expression; the pooling layer compresses the feature map of the enhanced expression to obtain a compressed feature map so as to make the feature map smaller, and the non-pooling layer is used for amplifying the compressed feature map in scale and is related to the maximum likelihood distance density map output of the joint point. The thermal layer is used for generating an initial global hand position.

In one possible design, the hand gesture classification branch includes:

a plurality of sequentially combined said convolutional layer, said normalization layer and said relu active layer, said pooling layer, said fully-connected layer and said relu active layer, and said softmax layer. Wherein the convolution layer is used for extracting feature points of the depth image, and the normalization layer is used for limiting the numerical range of the feature points; the relu activation layer is used for outputting the feature points as feature maps of enhanced expression, the pooling layer compresses the feature maps of the enhanced expression to obtain compressed feature maps so as to enable the feature maps to be small, the full connection layer is used for connecting the compressed feature maps to obtain local feature maps, and the softmax layer is used for classifying the local feature maps through the probability of outputting the hand postures to obtain the current local hand postures.

In a third aspect, a system for tracking hand gestures provided by an embodiment of the present invention includes: the device comprises a memory and a processor, wherein the memory stores executable instructions of the processor; wherein the processor is configured to perform the method of hand gesture tracking of any of the first aspect via execution of the executable instructions.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for hand gesture tracking according to any one of the first aspect.

The invention provides a method, a device and a system for tracking hand gestures, wherein the method comprises the steps of obtaining a depth image containing a hand area; acquiring a target hand gesture through a target network model according to the depth image; the target network model is a learning model which is trained in advance and used for carrying out feature extraction and feature recognition on the depth image so as to output the target hand gesture. The accuracy and the efficiency of the posture tracking are improved, and the real-time posture tracking can be realized without high computing equipment resources (such as a GPU (Graphics Processing Unit)). The target network model can realize the processing processes of hand posture optimization, hand shape optimization, hand posture shape joint optimization and the like, and can meet the requirements of accuracy and real-time performance of posture tracking.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a hand gesture tracking method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a hand gesture tracking method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a network model in the hand gesture tracking method according to the second embodiment of the present invention;

FIG. 4 is a flowchart of a hand gesture tracking method according to a third embodiment of the present invention;

fig. 5 is a schematic diagram illustrating a part of effects in a hand gesture tracking method according to a third embodiment of the present invention;

fig. 6 is a schematic structural diagram of a hand gesture tracking system according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Hand gesture tracking is mostly realized by using a fixed hand model in the prior art, accuracy rate of hand gesture tracking is reduced, and user experience is reduced.

Fig. 1 is a flowchart of a hand gesture tracking method according to an embodiment of the present invention, and as shown in fig. 1, the method in this embodiment may include:

s101, acquiring a depth image containing a hand area.

Specifically, a depth camera is adopted to shoot a depth image containing a hand area.

A depth image is an image or image channel containing information about the distance to the surface of a scene object from a viewpoint. Each pixel value of the depth image comprises the actual distance between the sensor and the object, the pixel points have one-to-one correspondence, and the number of bits used for storing each pixel can be in one-to-one correspondence with the resolution of the measured image color. The method of acquiring the depth image may include a passive range sensing method or an active depth sensing method. The most common method in passive distance measurement sensing is binocular stereo vision, in the method, two cameras at a certain distance are used for simultaneously acquiring two images of the same scene, corresponding pixel points in the two images are found through a stereo matching algorithm, time difference information is calculated according to a trigonometric principle, and parallax information can be used for representing depth information of objects in the scene through conversion. The most obvious features of active ranging sensing compared to passive ranging sensing are: the device itself needs to transmit energy to accomplish the acquisition of depth information. This also ensures that the acquisition of the depth image is independent of the acquisition of the color image. The active depth sensing method mainly includes TOF (Time of Flight), structured light, laser scanning, and the like.

In this embodiment, the depth image includes three-dimensional representation information of the hand region, and the gray value of each pixel point may be used to represent the distance from a certain point of the hand region to the camera in the scene. Typically by a stereo camera or a TOF camera. If the parameters are fixed by the camera, the depth image can be converted into a point cloud. The principle of the TOF camera for acquiring the depth image is as follows: by emitting successive near infrared pulses to the target scene and then receiving the light pulses reflected back by the object with the sensor. By comparing the phase difference between the emitted light pulse and the light pulse reflected by the object, the transmission delay between the light pulses can be calculated, the distance between the object and the emitter can be further obtained, and finally a depth image can be obtained.

S102, acquiring a target hand gesture through a target network model according to the depth image; the target network model is a learning model which is trained in advance and used for performing feature extraction and feature recognition on the depth image to obtain a target initial hand posture, and optimizing the target initial hand posture to output the target hand posture.

In this embodiment, the target network model is different from a classical CNN neural network, and includes a plurality of convolution layers, normalization layers, and relu active layers that are sequentially combined, at least one pooling layer, a non-pooling layer, and a plurality of full-link layers and relu active layers that are sequentially combined. And performing feature extraction and feature recognition on the depth image containing the hand region to obtain a target initial hand posture, and optimizing the target initial hand posture to output a learning model of the target hand posture.

The embodiment obtains a depth image containing a hand area; according to the depth image, feature extraction and feature recognition are carried out on the depth image through a target network model to obtain a target initial hand posture, and the target initial hand posture is optimized to output a learning model of the target hand posture. The accuracy and the efficiency of gesture tracking are improved, and real-time hand gesture tracking can be realized without high computing equipment resources (such as a GPU (Graphics Processing Unit)).

Based on the above embodiment, referring to fig. 2, fig. 2 is a flowchart of a hand gesture tracking method according to a second embodiment of the present invention, in this embodiment, before obtaining a target hand gesture through a target network model according to a depth image, the method further includes step S201 and step S202, that is, the method in this embodiment includes:

s201, constructing a posture initialization network model, wherein the posture initialization network model comprises a hand global positioning branch and a hand posture classification branch; the hand global positioning branch is used for extracting feature points of the hand in the depth image and outputting a hand global posture (namely a global coordinate and an orientation of the hand) according to the feature points; the hand gesture classification branch is used for extracting feature points of the depth image, and performing matching classification according to the feature points of the preset hand reference gesture and the feature points of the depth image to obtain the current local gesture of the hand (namely the rotation angle of each joint point of the hand). The hand global pose is 3D coordinates and 3 rotational degrees of freedom of the whole hand in space, and the hand shape is included in the hand global pose. The current local hand pose refers to the rotation angles of the various joint points of the hand.

Specifically referring to fig. 3, fig. 3 is a schematic structural diagram of a network model in the hand gesture tracking method according to the second embodiment of the present invention, as shown in fig. 3, 1 is a convolutional layer, a normalization layer, and a relu active layer, 2 is a pooling layer, 3 is a non-pooling layer, 4 is a thermal layer, 5 is a full-link layer and a relu active layer, and 6 is a softmax layer. The method comprises the steps of constructing a gesture initialization network model, wherein the gesture initialization network model comprises a hand global positioning branch, a hand gesture classification branch and a gesture optimization module, the hand global positioning branch can sequentially comprise a plurality of combined convolution layers, a normalization layer, a Relu activation layer, a pooling layer and a non-pooling layer, the convolution layers are used for extracting feature points of a hand in a depth image, the normalization layer is used for limiting numerical values of the feature points to a certain range so as to eliminate adverse effects caused by singular feature points, the Relu activation layer is used for outputting the feature points as a feature map with enhanced expression, the pooling layer compresses the feature map to obtain a compressed feature map so as to enable the feature map to be small, and the non-pooling layer is used for amplifying the compressed feature map in scale and is related to the maximum likelihood distance density map output of joint points. And finally, generating an initial global hand position through a thermal layer.

The hand posture classification branch can sequentially comprise a plurality of combined convolutional layers, a normalization layer, a relu active layer, a pooling layer, a full-link layer, a relu active layer and a softmax layer (softmax layer), wherein the convolutional layers are used for extracting feature points of a depth image, the normalization layer is used for limiting the numerical values of the feature points to a certain range so as to eliminate adverse effects caused by singular feature points, the relu active layer is used for outputting the feature points as a feature map with enhanced expression, the pooling layer compresses the feature map to obtain a compressed feature map so as to reduce the feature map, the full-link layer is used for connecting the compressed feature maps to obtain a local feature map, and the softmax layer is used for classifying the local feature map through the probability of outputting the posture of a hand to obtain the current local posture of the hand.

The gesture optimization module is used for matching and fusing the global gesture of the hand and the current gesture of the hand to obtain a target initial hand gesture, namely generating a hand initial model, and then optimizing the target initial hand gesture by using a target equation to finally obtain the target hand gesture. S202, training the posture optimization module through the training data set to obtain a target network model.

Specifically, the gesture optimization module is used for matching and fusing the global hand gesture and the current local hand gesture to obtain a target initial hand gesture, and the target initial hand gesture is optimized by using a target equation to finally obtain the target hand gesture.

In this embodiment, the training data set includes a large number of depth images of hand movements of the user such as opening, closing, pinching, fist making, scissors and the like, the depth images of the training data set are input into the posture initialization network model, and the target network model is obtained through a multiple iteration training posture optimization module.

And S203, acquiring a depth image containing the hand region.

And S204, acquiring the target hand gesture through the target network model according to the depth image.

In this embodiment, please refer to the relevant description in steps S101 to S102 in the method shown in fig. 1 for the specific implementation process and technical principle of steps S203 to S204, which is not described herein again.

In this embodiment, a depth image including a hand region is obtained, a target network model is obtained by training a posture initialization network model, and then a target hand posture is obtained through the target network model according to the depth image. The accuracy and the efficiency of the posture tracking are improved, and the real-time posture tracking can be realized without high computing equipment resources (such as a GPU (Graphics Processing Unit)).

Referring to fig. 4 in particular, fig. 4 is a flowchart of a hand gesture tracking method according to a third embodiment of the present invention. As shown in fig. 4, the method for tracking hand gestures in this embodiment may further include, before the step of obtaining the target hand gesture through the target network model, step S300 of obtaining a depth image including a hand region, and for concrete implementation processes and technical principles of step S300, reference is made to the related description in step S101 in the method shown in fig. 1, which is not repeated herein.

S301, extracting feature points of the hand in the depth image, and outputting the global posture of the hand according to the feature points.

Specifically, a heatmap (thermodynamic diagram) of the position of the hand root node is output according to the hand root node (including the three-dimensional coordinates of the joint point) and the position of the hand root node, then a Gaussian probability model is used for fitting to obtain the spatial coordinates of the root node (namely 3 spatial coordinate degrees of freedom of the hand), and then a Principal Component Analysis (PCA) method is used for calculating the orientation of the feature point cloud, namely three global rotation orientation degrees of freedom of the hand. Where the root node is the coordinate at the wrist of the hand.

S302, extracting the feature points of the depth image, and performing matching classification according to the feature points of the preset hand reference posture and the feature points of the depth image to obtain the current local posture of the hand. Specifically, feature points extracted from the depth image are matched with feature points of a preset hand reference posture, and the current local hand posture is obtained through classification. The hand reference gestures can include hand opening, hand closing, hand pinching and other action gestures, but do not include a hand global gesture.

The sequence of steps S301 to S302 is not limited.

S303, matching and fusing the global hand posture and the current local hand posture to obtain the target initial hand posture, and optimizing the target initial hand posture by using a target equation to obtain the target hand posture.

Specifically, after the target initial hand gesture is subjected to hand gesture optimization processing in step S303, hand shape and hand gesture shape joint optimization is performed; and obtaining the optimized hand gesture, the target hand shape and the target hand gesture.

The gesture optimization module comprises a target equation of a plurality of optimization function items, and the optimization function items of the gesture optimization module are constructed through preset constraint conditions, wherein the optimization function items comprise various optimization function items corresponding to hand gesture optimization and optimization function items corresponding to hand shape optimization; and iteratively training the attitude optimization module to obtain a target network model.

In this embodiment, the gesture optimization module fuses the global gesture of the hand and the current local gesture of the hand to estimate that the target initial hand gesture is combined with the variable-size hand model, that is, the hand initial model is generated, and the hand gesture optimization processing is sequentially performed to optimize the hand shape and the hand gesture shape, so as to obtain the optimized hand gesture, the target hand shape, and the target hand gesture, that is, the target hand model.

Referring to fig. 5 in particular, fig. 5 is a schematic diagram illustrating a part of effects of a hand gesture tracking method according to a third embodiment of the present invention. As shown in fig. 5, a black portion (without black dots) indicates that the depth image at the feature point is before the initial hand model (i.e., the initial hand model does not cover the depth image), a black portion with black dots indicates that the initial hand model at the feature point is before the depth image, and a white portion indicates that the initial hand model at the feature point is matched and fused with the depth image, and the error is within ± 5 mm. So a better optimization result is that the initial model of the hand matches the depth image (i.e. more white parts are better).

Specifically, from left to right, the first graph represents a target initial hand posture output after passing through the target network model, that is, a rough hand posture is estimated, and a generated hand initial model is constructed. The second graph is the result of only pose optimization, and it can be seen that the thumb of the initial hand model is already matched with the depth image after optimization, resulting in an optimized hand pose, but the other fingers are not matched due to length. The third graph is only the optimization of the hand shape, and it can be seen that the length of the hand in the initial hand model is optimized and matched with the depth image to obtain the target hand shape. And the last figure is a target hand gesture obtained after the hand gesture shape joint optimization, namely a target hand model is generated.

Meanwhile, in the optimization process, a plurality of prior constraints of hands are introduced, such as hand collision, joint rotation, time sequence information and the like, in the process of sequentially performing hand posture optimization processing and performing hand shape and hand posture shape combined optimization, optimization function items contained in each optimization processing process are set as weights, and various different states of different hands (such as whether self-shielding is serious, whether the hands are lost or not and whether motion blur is large or not) are considered. And an optimization function item of the attitude optimization module is constructed by presetting the prior constraint conditions, and the attitude optimization model is iteratively trained to obtain a target network model.

The hand model is represented as H (θ) by a series of consecutive depth image sequences It, t being the corresponding time index^tβ), wherein θ^tIs the hand posture parameter at the time t, beta is the hand shape parameter, the purpose of the hand posture estimation algorithm is to find out a theta^tThe following target equation (i.e., formula one) is satisfied:

representing hand area, using x

Using the pose initialization network model to obtain an initial solution to the equation

The substitution equation for this initial solution can be listed as:

the current hand gesture can be solved by using an optimization method according to a formula (II), but due to self-shielding and depth image noise caused by high degree of freedom of the hand, the obtained solution may not accord with the real situation in the real environment, so that the gesture can be limited by using prior constraints of various hands, joint rotation limitation of the hand, collision limitation of the hand and timing constraint limitation of the hand are mainly used, and the formula is respectively as follows:

hand joint rotation limitation E_bound：

Wherein the content of the first and second substances,

and

representing the range of rotation angle, ω, of the i-th joint₁Weight coefficient, ω, representing the minimum angle corresponding to the hand joint rotation angle₂Weight coefficient, λ, representing the maximum angle corresponding to the hand joint rotation angle₃Representing the weight coefficient corresponding to the objective function term

Hand collision limitation E_col：

Wherein λ₄Representing the weight coefficient corresponding to the objective function term, delta representing the calculation of the corresponding integral sign, d representing the distance between Xi and Xj fingertip, χ (i, J) being used for indicating whether the ith joint collides with the jth joint, J_skel(x_i) And in an alternative embodiment, the first three rows of the jacobian matrix are called position jacobian matrix and represent the global posture of the hand, and the last three rows of the jacobian matrix are called orientation matrix and represent the current local posture of the hand, and the calculation can be carried out by utilizing the relevant content of inverse dynamics.

Timing constraint limit of hand, Etemp:

wherein k is_iThe current coordinates of the ith joint are represented,

it indicates the estimated position of its previous frame.

Referring to the following table 1, in the hand posture optimization process, and the hand shape and hand posture shape joint optimization process, the authority of each optimization function item is set at different stages.

TABLE 1

In this embodiment, the hand posture (mainly for the rotation angle of the hand joints) and the hand shape (mainly for the length of each skeleton of the hand) have different dimensions, and are directly and simultaneously optimized, so that the hand posture and the hand shape are easy to fall into local minimum solutions. The weight of each optimization function item in the staged optimization process is set through the target network model, so that the estimation of hand posture optimization and hand shape optimization can be simultaneously met, for example, the estimation of which joint bones are shielded and do not need optimization is carried out, and the like. Meanwhile, the accuracy and the real-time performance of posture tracking are realized.

Fig. 6 is a schematic structural diagram of a hand gesture tracking system according to a fourth embodiment of the present invention, and as shown in fig. 6, the hand gesture tracking system 40 according to the present embodiment may include: a processor 41 and a memory 42.

A memory 42 for storing a computer program (such as an application program, a functional module, and the like implementing the hand gesture tracking method described above), a computer instruction, and the like;

the computer programs, computer instructions, etc. described above may be stored in one or more memories 42 in partitions. And the above-mentioned computer program, computer instructions, data, etc. can be called by the processor 41.

A processor 41 for executing the computer program stored in the memory 42 to implement the steps of the method according to the above embodiments.

Reference may be made in particular to the description relating to the preceding method embodiment.

The processor 41 and the memory 42 may be separate structures or may be integrated structures integrated together. When the processor 41 and the memory 42 are separate structures, the memory 42 and the processor 41 may be coupled by a bus 43.

The server in this embodiment may execute the technical solutions in the methods shown in fig. 1, fig. 2, and fig. 4, and the specific implementation process and technical principle of the technical solutions refer to the related descriptions in the methods shown in fig. 1, fig. 2, and fig. 4, which are not described herein again.

The invention provides a method, a device and a system for tracking hand gestures, wherein the method comprises the steps of obtaining a depth image containing a hand area; acquiring a target hand gesture through a target network model according to the depth image; the target network model is a learning model which is trained in advance and used for performing feature extraction and feature recognition on the depth image to obtain a target initial head gesture and optimizing the target initial hand gesture to output the target hand gesture. The accuracy and the efficiency of the posture tracking are improved, and the real-time posture tracking can be realized without high computing equipment resources (such as a GPU (Graphics Processing Unit)). The target network model can realize the processing processes of hand posture optimization, hand shape optimization, hand posture shape joint optimization and the like, and can meet the requirements of accuracy and real-time performance of posture tracking.

In addition, embodiments of the present application further provide a computer-readable storage medium, in which computer-executable instructions are stored, and when at least one processor of the user equipment executes the computer-executable instructions, the user equipment performs the above-mentioned various possible methods.

Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in user equipment. Of course, the processor and the storage medium may reside as discrete components in a communication device.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of hand pose tracking, comprising:

acquiring a depth image containing a hand region;

acquiring a target hand gesture through a target network model according to the depth image; the target network model is a learning model which is trained in advance and used for performing feature extraction and feature recognition on the depth image to obtain a target initial hand posture, and optimizing the target initial hand posture to output the target hand posture;

before obtaining the target hand gesture through the target network model according to the depth image, the method further comprises the following steps:

constructing a posture initialization network model, wherein the posture initialization network model comprises a hand global positioning branch and a hand posture classification branch; the hand global positioning branch is used for extracting feature points of a hand in the depth image and outputting a hand global posture according to the feature points; the hand gesture classification branch is used for extracting feature points of the depth image, and performing matching classification according to the feature points of a preset hand reference gesture and the feature points of the depth image to obtain the current local gesture of the hand; training a posture optimization module through a training data set to obtain a target network model;

2. The method of claim 1, wherein obtaining a depth image containing a hand region comprises:

a depth camera is used to capture a depth image containing a hand region.

3. The method according to claim 1, wherein the pose optimization module comprises an objective equation of a plurality of optimization function terms, and the optimization function terms of the pose optimization module are constructed through preset constraint conditions, wherein the optimization function terms comprise a plurality of optimization function terms corresponding to hand pose optimization and optimization function terms corresponding to hand shape optimization.

4. A hand gesture tracking device applying the method as claimed in any one of claims 1-3, wherein the target network model comprises: the hand global positioning branch and the hand posture classification branch are respectively composed of a plurality of convolution layers, a normalization layer, a relu active layer, a pooling layer, a non-pooling layer, a thermal layer, a full-connection layer, a relu active layer and a softmax layer;

the hand global positioning branch comprises: the system comprises a plurality of convolutional layers, a normalization layer, a relu activation layer, a pooling layer and a non-pooling layer, wherein the convolutional layers are used for extracting feature points of a hand in a depth image; the normalization layer is used for setting a numerical range of the characteristic points; the relu activation layer is used for outputting the feature points into a feature map with enhanced expression; the pooling layer compresses the feature map of the enhanced expression to obtain a compressed feature map so as to reduce the feature map, and the non-pooling layer is used for amplifying the compressed feature map in scale and is related to the maximum likelihood distance density map output of the joint point; the thermal layer is used for generating an initial global position of the hand;

the hand gesture classification branch comprises: a plurality of sequentially combined said convolutional layer, said normalization layer and said relu active layer, said pooling layer, said fully-connected layer and said relu active layer, and a softmax layer; wherein the convolutional layer is used for extracting the characteristic points of the depth image; the normalization layer is used for defining a numerical range of the characteristic points; the relu activation layer is used for outputting the feature points as a feature map of enhanced expression; the pooling layer compresses the feature map of the enhanced expression to obtain a compressed feature map so as to reduce the feature map; the full connection layer is used for connecting the compressed characteristic diagrams to obtain local characteristic diagrams; the softmax layer is used for classifying the local feature map to obtain the current local hand posture by outputting the probability of the hand posture.

5. A system for hand pose tracking, comprising: the device comprises a memory and a processor, wherein the memory stores executable instructions of the processor; wherein the processor is configured to perform the method of hand gesture tracking of any of claims 1-3 via execution of the executable instructions.

6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of hand gesture tracking according to any one of claims 1 to 3.