CN113269089B

CN113269089B - Real-time gesture recognition method and system based on deep learning

Info

Publication number: CN113269089B
Application number: CN202110574202.7A
Authority: CN
Inventors: 宋海涛; 盛斌; 王资凯; 王天逸; 谭峰; 李佳佳; 赵亦博; 鞠睿
Original assignee: Shanghai Artificial Intelligence Research Institute Co ltd
Current assignee: Shanghai Artificial Intelligence Research Institute Co ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2023-07-18
Anticipated expiration: 2041-05-25
Also published as: CN113269089A

Abstract

The invention discloses a real-time gesture recognition method and a system based on deep learning, wherein the method comprises the following steps: collecting an image and extracting a hand depth image in the image by utilizing a target detection network; converting the hand depth image into 3D voxelized data, and inputting the 3D voxelized data into a V2V-PoseNet network to obtain hand key point data; the V2V-PoseNet network is a V2V-PoseNet network for pruning; and preprocessing the hand key points, inputting the preprocessed hand key points into a classification network, and classifying gesture actions to obtain gesture categories. The method provided by the invention combines the front deep learning model, avoids the introduction of artificial definition characteristics, and has strong generalization capability and expression capability and good expansibility. The existing model used in the system is pruned and optimized according to the task requirement, and the speed of the model is improved on the premise of not affecting the precision. The key point detection and action classification on the data set MSRAhand are good.

Description

Real-time gesture recognition method and system based on deep learning

Technical Field

The invention relates to the technical field of deep learning and gesture recognition, in particular to a real-time gesture recognition method and system based on deep learning.

Background

Today, the human science and technology is rapidly developed, and the human-machine interaction technology is widely used in daily life of people, and more abundant applications are continuously developed. Man-machine interaction techniques may enable people to communicate with a machine device in a variety of ways and in a variety of languages, including gesture languages. Gestures are a natural and visual interpersonal communication mode used in life of people, and as man-machine interaction gradually shifts to people as a center, research on gesture recognition gradually becomes a research hot spot. It provides a means for users to interact naturally between virtual environments, which is one of the most popular human interface technologies. However, gesture recognition based on vision is a very challenging multi-disciplinary cross-research topic due to the diversity, ambiguity, and temporal and spatial variability of gestures themselves, as well as the complexity of human hands and the discomfort of vision itself. With the continuous development of related technologies such as image processing and pattern recognition, and the wide application of natural man-machine interaction, people begin to study gesture recognition technology with emphasis.

Some of the drawbacks and deficiencies of existing gesture recognition techniques include: low precision, low speed, high power consumption, opaque algorithm, etc. In addition, some methods have limitations in application range. For example, although the template matching method frequently used in static gesture recognition is fast, the method can only process static gestures, and cannot recognize continuous gesture actions composed of multiple frames of videos.

Disclosure of Invention

Accordingly, the invention aims to provide a real-time gesture recognition method and system based on deep learning, which can accurately recognize gesture actions in real time, and has high speed and high precision.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the invention provides a real-time gesture recognition method based on deep learning, which comprises the following steps:

collecting an image and extracting a hand depth image in the image by utilizing a target detection network;

converting the hand depth image into 3D voxelized data, and inputting the 3D voxelized data into a V2V-PoseNet network to obtain hand key point data; the V2V-PoseNet network is a V2V-PoseNet network for pruning;

and preprocessing the hand key points, inputting the preprocessed hand key points into a classification network, and classifying gesture actions to obtain gesture categories.

Further, the hand depth image is acquired by:

acquiring a depth image and an RGB image;

inputting the RGB image into a YOLOv3 network to obtain a hand bounding box;

and aligning the depth image with the RGB image, cutting the depth image according to coordinates of the hand bounding box, and separating the hand region and the background region to obtain the hand depth image.

Further, the hand keypoint data is realized by the following steps:

3D voxelized data is performed as follows: converting the depth image into a 3D volume form, re-projecting points into a 3D space, discretizing a continuous space, and setting voxel values of the discrete space according to the voxel space position and the target object;

and taking the 3D voxelized data as the input of the V2V-PoseNet network, calculating the likelihood that each key point belongs to each voxel, identifying the position corresponding to the highest likelihood of each key point, and converting the position into real world coordinates to become hand key point data.

Further, the pretreatment includes the steps of:

determining an initial position: taking the palm root point of the first frame of hand image as a datum point;

determining the hand size: the average distance from the palm root to the five-finger root of the hand image is adjusted to be a preset value, and all coordinates are subjected to equal proportion transformation according to the following formula:

wherein y is _ij To adjust the coordinate of the jth node of the ith frame, x _ij To adjust the coordinates of the jth node of the previous ith frame, x ₀₀ Is the coordinates of the palm root of the 0 th frame,is the index of the root of the t-th finger.

Further, the gesture classification is performed according to the following steps:

predefining gesture actions as static gestures and dynamic gestures according to the hand key point data;

establishing a static gesture classification network and a dynamic gesture classification network;

and selecting a corresponding classification network to carry out gesture classification according to the static gesture and the dynamic gesture.

Further, the static gesture classification network is a fully connected network; the dynamic gesture classification network is a space-time diagram convolution network model; the space-time diagram convolutional network model is classified according to the following steps: establishing a multi-frame hand node space-time diagram and inputting a space-time diagram convolution network model to obtain a full-diagram feature vector; and obtaining a classification result by using a fully connected network.

Further, the multi-frame hand node time space diagram is built according to the following steps:

acquiring continuous T-frame gesture images, wherein each gesture image has N key points;

the space-time diagram formed by the multi-frame hand joint points is combined and simplified, the node information is combined through a certain corresponding relation, and the combined node value is calculated according to the following formula:

wherein y is _ij For the feature vector of the j-th node of the i-th frame after combination, A _j Set of indexes of node before combination corresponding to jth node after combination, w _α For the coefficients corresponding to that class.

Wherein the value of each node is calculated according to the following formula:

wherein y is _ij A is the characteristic vector of the j-th node of the i-th frame in the next layer _ijt Is the set of indexes of points with the distance t from the jth joint point of the ith frame in the space-time skeleton diagram, w _jt And h is a pre-specified maximum acting distance for the corresponding coefficient.

The invention also provides a real-time gesture recognition system based on deep learning, which comprises a hand deep image extraction unit, a hand key point detection unit and a gesture action classifier;

the hand depth image extraction unit is used for acquiring images and extracting hand depth images in the images by utilizing a target detection network; the hand depth image is acquired by the following steps: acquiring a depth image and an RGB image; inputting the RGB image into a YOLOv3 network to obtain a hand bounding box; aligning the depth image with the RGB image, cutting the depth image according to coordinates of the hand bounding box, and separating a hand area and a background area to obtain a hand depth image;

the hand key point detection unit is used for converting the hand depth image into 3D voxelized data and inputting the 3D voxelized data into a V2V-PoseNet network to obtain hand key point data; the hand key point data is realized through the following steps: 3D voxelized data is performed as follows: converting the depth image into a 3D volume form, re-projecting points into a 3D space, discretizing a continuous space, and setting voxel values of the discrete space according to the voxel space position and the target object; the 3D voxelized data is used as the input of a V2V-PoseNet network, the likelihood that each key point belongs to each voxel is calculated, the position corresponding to the highest likelihood of each key point is identified, and the position is converted into real world coordinates to become hand key point data;

the gesture action classifier is used for preprocessing the hand key point data and inputting the hand key point data into the classification network to classify gesture actions to obtain gesture categories; the gesture classification is carried out according to the following steps: predefining gesture actions as static gestures and dynamic gestures according to the hand key point data; establishing a static gesture classification network and a dynamic gesture classification network; selecting a corresponding classification network to carry out gesture classification according to the static gesture and the dynamic gesture; the static gesture classification network is a fully connected network; the dynamic gesture classification network is a space-time diagram convolution network model; the space-time diagram convolutional network model is classified according to the following steps: establishing a multi-frame hand node space-time diagram and inputting a space-time diagram convolution network model to obtain a full-diagram feature vector; and obtaining a classification result by using a fully connected network.

Further, the pretreatment includes the steps of:

wherein y is _ij A is the characteristic vector of the j-th node of the i-th frame in the next layer _ijt Is the set of indexes of points with the distance t from the jth node of the ith frame in the space-time skeleton diagram, w _jt And h is a pre-specified maximum acting distance for the corresponding coefficient.

The invention has the beneficial effects that:

the invention provides a real-time gesture recognition method and a real-time gesture recognition system based on a deep learning method, which mainly comprise three parts: firstly, extracting a hand bounding box based on RGB images by using a deep learning target detection method for color and depth (RGBD) images acquired by a depth camera, and separating a hand depth image based on the bounding box and the depth image; secondly, converting the depth information into voxel representation, and detecting the positions of key points of the hand by using a three-dimensional convolution network; finally, according to the positions of the key points, classification methods of static gesture actions and dynamic gesture actions using different network models are respectively provided. The method provided by the invention combines the front deep learning model, avoids the introduction of artificial definition characteristics, and has strong generalization capability and expression capability and good expansibility. The existing model used in the system is pruned and optimized according to the task requirement, and the speed of the model is improved on the premise of not affecting the precision. The key point detection and action classification on the data set MSRAhand are good.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

In order to make the objects, technical solutions and advantageous effects of the present invention more clear, the present invention provides the following drawings for description:

FIG. 1 is a schematic diagram of a system frame diagram YOLOv3 architecture V2V network architecture ST-GCN.

Fig. 2 is a schematic diagram of a hand detection flow.

FIG. 3 is a schematic diagram of the recognition results of single-frame and multi-frame gesture actions.

Fig. 4 is a space-time diagram of a plurality of frames of hand joint points.

Fig. 5 is a schematic diagram of a joint fusion method.

Fig. 6 is a schematic diagram of an example of object detection extraction hand bounding box.

Fig. 7 is a schematic diagram of a test on subset 0 of the MSRA dataset.

FIG. 8 is a schematic diagram of a multithreaded pipeline acceleration processing architecture.

In the figure, 1 represents hand depth image extraction, 2 represents hand key point detection, and 3 represents classifying gestures based on key points; 4 denotes an input of hand detection, 5 denotes a process of hand detection, 6 denotes an output of hand detection, and 7 denotes a hand bounding box.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to limit the invention, so that those skilled in the art may better understand the invention and practice it.

Example 1

As shown in fig. 1, fig. 1 is a schematic diagram of a system frame diagram YOLOv3 structure diagram V2V network structure diagram ST-GCN, and the gesture recognition flow in the system provided in this embodiment may be divided into three main stages: extracting a hand depth image, detecting key points of the hand, and classifying the gestures based on the key points.

First, a depth image and an RGB image that can be aligned with each other are obtained from a depth camera. Inputting the RGB image into a YOLOv3 network, extracting a hand bounding box based on the RGB image by using a deep learning target detection method, obtaining the hand bounding box, and separating a hand depth image based on the bounding box and the depth image;

then, converting the depth information into voxel representation, and detecting the positions of key points of the hand by using a three-dimensional convolution network;

and (3) aligning the depth image with the RGB image, reasonably cutting the depth image according to the bounding box coordinates, separating the hand region from the background region by using a threshold method, and taking the hand depth image as the key point detection input of the next stage. In the key point detection stage, the 2D depth map is subjected to reprojection and 3D voxelization, the hand key point positions in the depth image are predicted by utilizing a V2V network, and three-dimensional coordinates which are the predicted hand key points are output.

Finally, in the classification stage, according to the positions of the key points, classification methods of static and dynamic gesture actions by using different network models are respectively provided, the coordinates of the key points predicted by the V2V network are input into a fully-connected network or an ST-GCN classification network, and specific types of gestures are output.

The system used in the embodiment combines the front deep learning model, avoids the introduction of artificial definition features, and has strong generalization capability and expression capability and good expansibility. The existing model used in the system is pruned and optimized according to the task requirement, and the speed of the model is improved on the premise of not affecting the precision. The key point detection and action classification on the data set MSRAhand are good.

By adopting the structure, the three parts have certain independence, the respective optimization of each module can be independently carried out, and the improvement work when the models are connected in series can be considered.

In this embodiment, the hand bounding box is obtained by using the object detection model YOLOv3, the hand key point position is detected by using the V2V network based on 3D convolution, and the concept of space-time diagram convolution network (ST-GCN) is applied to gesture classification. Pruning and scene optimization are carried out on the model according to the problems, and the speed of model identification is improved on the premise of not affecting the precision. The model in the embodiment has the advantages of instantaneity, high precision, good stability and the like.

The target detection can be divided into two types, one type is a two-stage model, the model needs to generate a target candidate frame, namely a target position, according to an algorithm, and then the candidate frame is classified and regressed; the other is a one-stage model that uses only one convolutional neural network CNN to directly predict the class and location of different targets. The first type of method has higher accuracy but slow speed; the second type of method is fast but has lower accuracy. The first model is based on R-CNN as proposed by Rossgirshick et al. Rossgirshick also improved on it, suggesting that FastR-CNN, the model is faster than R-CNN. The mask-CNN is also provided by KaimingHe et al on the basis of R-CNN, and the model has higher accuracy. The second model is represented mainly by SSD as proposed by WeiLiu et al and YOLO as proposed by JosephRedmon et al.

Depth-based 3D hand keypoint identification: gesture recognition methods can be classified into a generation method, an authentication method, and a hybrid method. The generation method assumes a pre-set hand model and fits it to the input depth image by minimizing the cost function of hand generation. Particle motion optimization (PSO), iterative closest point of approach (ICP), and combinations thereof are common algorithms for achieving optimal hand pose results. The authentication method locates hand joints directly from the input depth map. The random forest based method provides fast and accurate performance. However, due to the need to exploit manual features, it has been replaced by recent CNN-based methods. Tompson et al first uses CNN to locate hand keypoints by estimating a two-dimensional heat map for each hand joint. Ge, etc. expands this approach by estimating the two-dimensional heat map for each view using multi-view CNN. Ge and the like converts the two-dimensional input depth map into a three-dimensional form, and directly estimates three-dimensional coordinates through three-dimensional CNN. Guo et al propose a region-aggregation network to accurately estimate the three-dimensional coordinates of hand keypoints, chen et al improve the network by iteratively refining the estimated poses. The hybrid approach combines a generation approach and an authentication approach. Oberweger et al have trained the identification and production of both identification CNN and production CNN through a feedback loop. The Zhou et al predefine a hand model and estimate the parameters of the model, rather than predicting the three-dimensional coordinates directly. Ye et al utilize a spatial attention mechanism and hierarchical PSO.

Classification algorithm: classification is the training of a classifier among a group of samples whose class labels are known to enable classification of some unknown sample. The naive Bayes classification method proposed by QiangWang et al is simple in logic and easy to implement. Kotsiantis et al utilized a Support Vector Machine (SVM) for classification, which effectively solves the high-dimensional problem. Classification using decision trees is applicable to nonlinear problems. With the development of deep learning, classification by using an artificial neural network has gradually become a hot spot of research. The human action classification problem based on the key point skeleton sequence can be divided into a method based on manual characteristics and a deep learning method. The first method describes the dynamics of the joint motion manually. The second type uses deep learning to build an end-to-end learning action recognition model, and in order to improve the accuracy, modeling of each part of the human body is required. The skeleton sequence is firstly proposed to be regarded as a space-time diagram, the characteristics are automatically extracted and classified by utilizing the convolution of the diagram, the manually defined traversing rules and body part definitions are not required to be introduced, and the characteristics of the skeleton sequence are fully utilized.

Example 2

The gesture recognition system based on deep learning provided in this embodiment includes a hand depth image extraction unit, a hand key point detection unit, and a gesture motion classifier, and specifically includes the following steps:

a hand depth image extraction unit: a depth somatosensory camera such as Microsoft Kinect is adopted for gesture/gesture detection, firstly, a human body/hand is separated from a background through a segmentation strategy, then a specific part is identified through a machine learning model, a skeleton model composed of key points is generated, and then recognition of actions is completed based on the skeleton.

The present embodiment employs hand extraction-key point recognition-action recognition based on RGBD images. However, for the first step of extracting and separating the hand to be detected, the target detection method is used instead of the segmentation strategy in this embodiment.

The specific flow is as follows: inputting RGB pictures, extracting bounding boxes of the hands by using a machine learning model for target detection, clipping depth maps corresponding to the RGB pictures by using the bounding box range, and separating the hands from background areas by a threshold method for the clipped depth maps to obtain depth information directly related to the hands.

The reasons for adopting the target detection method include (1) the cost of detecting the bounding box is small compared with the cost of the method for carrying out pixel level segmentation, and the current target detection method based on machine learning is efficient and stable; (2) Regarding the separation of the background and the hand in depth, due to the specificity of the gesture recognition task, that is, the hand is usually closest to the camera, the background object and the hand are usually far away, the depth threshold filtering method can be directly used in general practical application, or the OTSU method is adopted to binarize the depth image, so as to form a mask (mask) of the hand, which can be used for removing the background and retaining the depth information related to the hand. Also, in the key point detection method used in the present embodiment, the clipped hand depth map and mask may be directly used as inputs.

In this embodiment, the target detection network YOLOv3 is adopted, and the network has the advantages of detection speed and higher precision, and is one of the most widely used target detection models. YOLOv3 can predict on three scales, and has better performance for detecting small-scale objects. In order to further improve the detection speed, a network structure for channel pruning of the model is used, pre-training is performed on an open source data set OxfordHand, and single-target detection can be performed on hands in RGB pictures. Experiments show that the pre-training model has high precision for most tasks, and can accurately extract the bounding box. When a plurality of bounding boxes are detected, the content of the largest bounding box is retained by default. The model is insensitive only to the target that a small part of hands are relatively close to the shooting position, namely, the hands occupy most of the picture, and for the part of the picture, bounding box clipping can be omitted, and depth filtration is directly carried out on the depth map to remove irrelevant background and then the depth map is used as input in the next stage.

As shown in fig. 2, fig. 2 is a hand detection flow, after a cut hand depth image is obtained, hand key point detection can be performed by a hand key point detection unit, and the reference method is V2V-poisenet, and the key point detection flow is specifically as follows: first, the 2D depth image is converted into a 3D volumetric form, the points are re-projected into 3D space, and the continuous space is discretized. After voxelizing the 2D image, the V2V network takes the 3D voxelized data as input, and estimates the likelihood that each key point belongs to each voxel. Next, the position corresponding to the highest likelihood of each key point is identified, and converted into real world coordinates, and the real world coordinates are output as a final result.

The overall architecture of the model V2V network provided in this embodiment is as follows:

generating input data, and generating input data of a V2V network: the V2V network takes the 3D voxelized data as input to the model. A voxel is the smallest unit of division of digital data in three dimensions, conceptually similar to the smallest unit pixel in two dimensions. A pixel may represent its coordinates with a two-dimensional vector and thus a voxel may represent its spatial position with a three-dimensional vector.

The pixels on the obtained two-dimensional depth map of the hand are re-projected into a three-dimensional space according to camera parameters, and then the three-dimensional space is divided into discrete spaces according to the size of the voxels which are defined in advance. The input value of a voxel is set to 1 if the spatial position of this voxel is exactly occupied by the target object, otherwise to 0. After this step, all voxels are assigned 0 or 1 to indicate whether the three-dimensional space position of the voxel is occupied by the target object, and these voxel values are used as input to the V2V network to predict the coordinates of the keypoints of the hand. It can be seen that the input data of the V2V network is three-dimensional voxel data, whereas most existing model inputs for gesture recognition using deep learning are two-dimensional depth image data. In fact, there are two distinct disadvantages to directly regressing three-dimensional keypoint coordinates using two-dimensional depth maps: firstly, the two-dimensional depth image has the problem of perspective distortion (perspective distortion), so that the two-dimensional depth image is directly input into a neural network, and a model sees a distorted hand; and the second is that a highly nonlinear correspondence is formed between the two-dimensional depth image and the three-dimensional key point coordinates, and the highly nonlinear correspondence prevents the neural network from accurately predicting the key point coordinates. The two problems are not existed in the process of converting the two-dimensional depth image into the three-dimensional voxel data and then inputting the three-dimensional voxel data as a model, because the three-dimensional voxel data is similar to point cloud (point group) and is used for representing that the hand cannot have perspective distortion, in addition, the corresponding relation between the three-dimensional voxel and the three-dimensional key point coordinate is relatively simpler, the three-dimensional voxel and the three-dimensional key point coordinate are consistent in dimension, and the nonlinearity degree is not higher than that between the two-dimensional depth image and the three-dimensional key point coordinate, so that the model is better trained.

V2V network construction requires four components: the first is a volume basic block (volumetric basic block) consisting of three-dimensional convolution, normalization and activation functions, this block being located in the first and last part of the network; the second block is a volumetric residual block (volumetricalblock) derived from the two-dimensional residual block; the third is a volume downsampling block (volumetricd ownsshaping block), which is the same as the volume max-pooling layer; the last is a volume up-sampling block (volume up-sampling block) which consists of a volume deconvolution layer, a volume normalization layer and an activation function, and the addition of the normalization layer and the activation function in the deconvolution layer helps to simplify the learning process.

V2V networks perform voxel-to-voxel predictions. Therefore, it treats the Z-axis as an additional spatial axis based on the 3D convolutional network architecture. As shown in fig. 3, the V2V network is first a 7 x 7 volume base block and downsampled block. After downsampling the feature map, the useful local features are extracted with three consecutive residual blocks. The output of the volumetric residual block is passed through the encoder and decoder in sequence. The network inputs the output of the codec to two 1 x 1 volume basic blocks and one 1 x 1 volume convolution layer, and finally obtains the likelihood that the key point belongs to each voxel, which is the output of the network and is the final target of the model.

Pruning of V2V networks: in this embodiment, the V2V network is used to predict the location of the keypoints, however, although the V2V network has high accuracy and small error of the average keypoints, the V2V network uses 3D convolution, which results in long prediction time and difficulty in realizing real-time gesture recognition. Therefore, in this embodiment, when the V2V network is used to predict the position of the key point, certain pruning is performed on the V2V network, so as to simplify the network model and increase the calculation speed of the model. The V2V-PoseNet network provided by the embodiment performs pruning treatment, namely the V2V-PoseNet network is the V2V-PoseNet network performing pruning treatment; the V2V-pousenet network pruning process is to set the output dimension of the encoder of the V2V-pousenet network to be lower than the original set output dimension value, specifically, in this embodiment, the output dimension of the encoder is modified, the output dimension of the original V2V network encoder is 128 dimensions, and in this embodiment, the output dimension of the encoder is reduced to 96 dimensions. Correspondingly, the input dimension of the decoder is reduced from 128 to 96, and the output dimension of the decoder can be determined according to the actual situation.

Through testing, the modified model improves the operation speed on the premise of ensuring the precision.

Gesture motion classifier: after the hand keypoints are acquired, the gestures may be classified into predefined action semantics based on the hand keypoints. The gestures can be divided into two types, namely, one type is a static gesture, namely, a single frame picture can form a gesture containing semantics, as shown in fig. 3, fig. 3 is a schematic diagram of the recognition result of single-frame and multi-frame gesture actions, and any frame is taken out by the gestures (a) and (b) in fig. 3, and the gesture can respectively represent "1" and "2"; the other type is a dynamic gesture, i.e. a sequence of consecutive frames can form a gesture including semantics, while a single frame is meaningless, such as (c) and (d) in fig. 3 constitute a left-hand waving action and a right-hand waving action, respectively, but it may not be particularly meaningful to take out any frame. In classifying the two types of gestures, different classification modes should be adopted in consideration of the respective characteristics of the two types of gestures. The left column in fig. 3 forms semantics for a single frame gesture. The right column forms semantics for multi-frame gestures while single frames are meaningless.

Pretreatment: in the gesture recognition problem, the initial position and size of the hand should not affect gesture semantics, but in reality these data may be very different, so pre-processing of the hand key point data is required.

In order to solve the problem of initial position, the palm root point of the first frame of the gesture (the unique frame if the gesture is static) is used as a reference point, and the input coordinates of all joint points of the frame and the subsequent frames are the coordinates of the actual coordinates relative to the reference point; to deal with the hand size problem, the average distance from the palm root to the five-finger root is adjusted to 1, and all coordinates are equal-proportioned as shown in formula (1).

Classification model: for static gestures, classification is performed using a fully connected network.

For dynamic gestures, considering that the input scale is large, when the input skeleton is 256 frames and each frame contains three-dimensional coordinates of 21 joints of the hand, the input scale is about 16000, and a large number of parameters can be generated by full connection, so that the efficiency is seriously affected and the training is difficult. The proposed graph convolution method is used for the gesture classification problem.

Taking continuous T frames as a gesture, wherein each gesture has N key points, and establishing a multi-frame hand joint point space-time diagram according to the following rule: each joint of each frame is a node, and T is multiplied by N; the same joint points of adjacent frames are connected by edges; in the same frame, the practically adjacent joint points of the hands are connected by edges; the structure of the configurable graph is shown in fig. 4, and fig. 4 is a space-time diagram formed by a plurality of frames of hand joint points.

Conventional graph rolling methods generally do not change the node number and topology of the graph. However, when the number of layers is large, the irrelevant information may be seriously interfered with each other after multiple convolutions, for example, in the problem of gesture recognition, the coordinates of the finger tip of the index finger should not be directly related to the coordinates of the root of the little finger, and the convolutions pass through multiple information diffusion between nodes, so that the information of the nodes is recorded in the irrelevant nodes. Thus, merging the graphs simplifies providing a correlation reference to the model based on prior knowledge in reality. The specific method is that a graph which is simpler than the original graph is constructed, the graph is generated according to the connection relation in reality, but the node number is less than the total joint number, the node information is combined through a certain corresponding relation in convolution, and the node value after combination is as shown in a formula (2):

wherein y is _ij For the feature vector of the j-th node of the i-th frame after combination, A _j The index set of the node before combination corresponding to the j-th node is provided by priori knowledge, w _α For the coefficients corresponding to that class. As shown in fig. 5, the combined graph used in this time includes white, red, green, yellow, blue, and black node information in fig. 5 (a) in the white, red, green, yellow, blue, and black nodes in fig. 5 (b), respectively. FIG. 5 is a schematic diagram of a joint merging method, in which a graph convolution method may be used, and the value of each node is calculated according to the following formula (3):

wherein y is _ij A is the characteristic vector of the j-th node of the i-th frame in the next layer _ijt Is the set of indexes of points with the distance t from the jth node of the ith frame in the space-time skeleton diagram, w _jt And h is a pre-specified maximum acting distance for the corresponding coefficient. After passing through a plurality of graph convolution layers, the full graph feature vector can be obtained, and finally, a classification result is obtained by using a full connection network.

In this embodiment, hand detection is achieved by using PyTorch, and the following experimental results are all tested on an Nvidia1080 TiGPU. The pruned YOLOv3 is on the test set of the original data set, and the mAP detected by the target can reach 0.76. The model can also correctly identify the RGB pictures with flaws taken by the depth camera, and only the detection example is shown because of the lack of a true annotation (GroundTruth) in such a test, and the RGB picture example in fig. 6 is selected from the NYUHand dataset. In the test, the detection time when CUDA acceleration is used can be kept around 30 ms. Fig. 6 is a schematic diagram of a target detection extraction hand bounding box.

In testing the accuracy of the V2V network predicted keypoint locations, the MSRA gesture dataset is utilized in this embodiment. The MSRA gesture dataset contained 9 subsets, each subset containing 17 gesture categories, with a total of 21 labeled keypoints on each hand. The performance index utilized in this embodiment is an average critical error when evaluating the accuracy of model predictive keypoint location. Average keypoint error refers to the distance between the model predicted hand keypoint location and the true hand keypoint location in millimeters (mm).

The following is a comparison of the test results of the V2V network with other currently better hand keypoint prediction models on the MSRA gesture dataset.

Table 1: average keypoint error contrast of V2V networks with other models on MSRA gesture datasets

From the results, it can be seen that multi view; occlusion occluded view; crosssingets may be collectively referred to as a deep learning image algorithm model; deep++ depth 3D hand gesture recognition model.

The V2V network achieves good effect on the MSRA gesture data set, and the average key point error is only 7.49mm and is far smaller than other hand key point prediction models. In the experiment in this example, better results were obtained when subset 0 was used as the test set and subsets 1-8 were used as the training set, i.e. the average keypoint error was only 7.38mm. Fig. 7 shows the relationship between allowable spatial error and the ratio of keypoints with predicted values within the error range at subset 0 of the data set. Fig. 7 is a test on subset 0 of the MSRA dataset.

In addition, in the embodiment, the V2V network is pruned, and the calculation speed of the model is improved on the premise of not affecting the precision. In order to verify the change of the model precision and speed before and after pruning, in this embodiment, a test is performed on the 3 rd subset of the MSRA gesture data set, and the average key point error and average time of the model are explored. Experiments show that the average key point error before pruning is 10.64mm, and the average key point error after pruning is 11.02mm; the average detection time before pruning is 0.485s, and the average detection time after pruning is 0.32s. The experimental result shows that the accuracy of the model is not reduced too much through pruning operation, and the detection speed is greatly improved.

Gesture motion classification: training and testing in the MSRA dataset. The accuracy is shown in Table 2. The continuous sampling in table 2 is to intercept 500 frames in the dataset and to randomly sample, i.e. randomly select the T frame connection in the dataset as a gesture.

Table 2: single-frame and multi-frame gesture classification accuracy

The MSRA data set is a single-frame gesture, so that the single-frame precision is higher; when continuous sampling is performed, the model can further determine the gesture category, so that the precision is higher when a plurality of frames are continuously sampled; because the frames in the dataset are independent of each other, gestures are chaotic when randomly sampled, and unexpected gesture categories (such as co-directional movements of the gestures of 1 and 2 are regarded as the same gesture) can be generated, so that the multi-frame gesture classification accuracy is reduced. Note also that classification accuracy is highly dependent on coordinate regression accuracy, so the regressed joint point coordinates should be tested as input.

In this embodiment, a real-time gesture recognition system based on deep learning is designed and implemented, and a new effective thought is provided for a conventional process by partially improving the conventional process, for example, a gesture extraction method is converted from segmentation to target detection, a 2D depth map is converted into 3D voxels and then processed, and a space-time map convolution is used for feature extraction. In this embodiment, the problems are reasonably divided, so that each sub-problem can be optimized according to the requirements, and the current best (stateofthesart) deep learning method on the related sub-problem is reasonably combined into the system, so that the model generalization capability and the expression capability are strong, and the expansibility is good. In particular implementation, in the embodiment, according to the task requirement, pruning and optimization on the using method are performed on the used existing network, so that the model efficiency and the accuracy on the gesture recognition problem are improved, and progress is made in real-time performance and accuracy.

In the system, the input data of the latter module is output from the former module, so that the accuracy of the latter module depends on the former module. To address this problem, in order to enhance the robustness of the system, the improvement considered in the present embodiment is to employ an input multipurpose network structure. For example, the depth information generated by the first partial hand detection may be reused as the supplementary information in the third partial motion classification.

In the experiment, the classifier model has higher speed and longer key point detection time, and the real-time performance is hoped to be enhanced by improving the processing speed of the system through multithreading. And for gestures made up of multiple frames, the classification network needs to complete the processing of the multiple frames as a group as a whole. Aiming at the problems of system instantaneity and multi-frame picture continuous processing, a pipeline working mode shown in fig. 8 is designed. For example, assuming that the pre-sorting processing time is 3 times of the acquisition time of each picture, 4 threads are used to perform preprocessing in sequence to complete the detection of the key points, as shown in (a) of fig. 8, wherein the leftmost column is busy time, the second column is idle time, the third column is busy time, the rightmost column is idle time, a certain idle margin is set to prevent conflict, and then the processed key point frameworks are stacked in time sequence in a unified area to generate time sequence data, as shown in (b) of fig. 8, and finally the sorting model is input. The method can be used to increase throughput when the space is more efficient and the pre-processing speed is lower. FIG. 8 is a schematic diagram of a multithreaded pipeline acceleration processing architecture.

The above-described embodiments are merely preferred embodiments for fully explaining the present invention, and the scope of the present invention is not limited thereto. Equivalent substitutions and modifications will occur to those skilled in the art based on the present invention, and are intended to be within the scope of the present invention. The protection scope of the invention is subject to the claims.

Claims

1. A real-time gesture recognition method based on deep learning is characterized by comprising the following steps of: the method comprises the following steps:

preprocessing the hand key points, inputting the preprocessed hand key points into a classification network, and classifying gesture actions to obtain gesture categories;

the gesture classification is carried out according to the following steps:

selecting a corresponding classification network to carry out gesture classification according to the static gesture and the dynamic gesture;

the static gesture classification network is a fully connected network; the dynamic gesture classification network is a space-time diagram convolution network model; the space-time diagram convolutional network model is classified according to the following steps: establishing a multi-frame hand node space-time diagram and inputting a space-time diagram convolution network model to obtain a full-diagram feature vector; obtaining a classification result by using a fully connected network;

the pretreatment comprises the following steps:

wherein y is _ij To adjust the coordinate of the jth node of the ith frame, x _ij To adjust the coordinates of the jth node of the previous ith frame, x ₀₀ Is the coordinates of the palm root of the 0 th frame,index for the root of the t-th finger;

the multi-frame hand node time-space diagram is established according to the following steps:

wherein y is _ij For the feature vector of the j-th node of the i-th frame after combination, A _j Set of indexes of node before combination corresponding to jth node after combination, w _α Coefficients corresponding to the class;

2. The deep learning based real-time gesture recognition method of claim 1, wherein: the hand depth image is acquired by the following steps:

acquiring a depth image and an RGB image;

inputting the RGB image into a YOLOv3 network to obtain a hand bounding box;

3. The deep learning based real-time gesture recognition method of claim 1, wherein: the hand key point data is realized through the following steps:

4. A real-time gesture recognition system based on deep learning, characterized in that: the hand depth image extraction unit, the hand key point detection unit and the gesture action classifier are included;

the gesture action classifier is used for preprocessing the hand key point data and inputting the hand key point data into the classification network to classify gesture actions to obtain gesture categories; the gesture classification is carried out according to the following steps: predefining gesture actions as static gestures and dynamic gestures according to the hand key point data; establishing a static gesture classification network and a dynamic gesture classification network; selecting a corresponding classification network to carry out gesture classification according to the static gesture and the dynamic gesture; the static gesture classification network is a fully connected network; the dynamic gesture classification network is a space-time diagram convolution network model; the space-time diagram convolutional network model is classified according to the following steps: establishing a multi-frame hand node space-time diagram and inputting a space-time diagram convolution network model to obtain a full-diagram feature vector; obtaining a classification result by using a fully connected network;

5. The deep learning based real-time gesture recognition system of claim 4, wherein: the pretreatment comprises the following steps: