CN110555383A

CN110555383A - Gesture recognition method based on convolutional neural network and 3D estimation

Info

Publication number: CN110555383A
Application number: CN201910703355.XA
Authority: CN
Inventors: 陈分雄; 蒋伟; 王晓莉; 熊鹏涛; 韩荣; 叶佳慧; 王杰
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2019-12-10

Abstract

The invention provides a gesture recognition method based on a convolutional neural network and 3D estimation, which comprises the following steps of: processing the image to be recognized by adopting a SegNet-base network model so as to extract a hand mask feature map in the image to be recognized; constructing a depth convolution network DetectNet based on supervised learning, and carrying out network positioning on hand joint points in the hand mask feature map to obtain a gesture recognition result of 2D estimation; and processing the gesture recognition result of the 2D estimation by adopting a PoseNormNet model based on the standard frame and the viewpoint estimation to obtain a gesture recognition result of the 3D estimation.

Description

Gesture recognition method based on convolutional neural network and 3D estimation

Technical Field

the invention relates to the technical field of artificial intelligence, in particular to a gesture recognition method based on a convolutional neural network and 3D estimation.

background

Gestures refer to actions performed by hands, including actions of palms, fingers or arms, and gesture recognition, which is the core in novel human-computer interaction, is the central importance in the whole interactive system. The interactive interface based on gesture control is intuitive and easy to operate, user experience can be greatly improved, the interactive interface has wide application in life, and the application field comprises: the virtual reality field, the intelligent home field, the medical field and the like. In addition, the gesture recognition has very important practical significance for the fields of assisting automobile driving, sign language translation and the like. The recognition of both simple static hand movements and complex dynamic hand movements is an important interface for human-computer interaction. Static gestures are one posture state of the hand at a time, and the relevance between different static gestures is small. A dynamic gesture is a sequence of movements of the hand for a short and continuous time, i.e. a set of static gestures spread out in time, but there is a strong correlation between these static gestures. The use of a dynamic gesture recognition system increases the computational requirements of the device and increases the cost of the product, while a stable and efficient static gesture recognition system can accomplish a considerable amount of things with less system resources.

the hand structure has many joints and high degree of freedom, which causes serious occlusion of the hand posture and very high similarity between partial actions. And therefore require an appropriate representation to construct the hand structure. For a long time, the research of gesture recognition based on vision is limited to the traditional mode, and the method often depends on the selection of manual features, such as skin color and hand texture, to detect the gesture.

After a depth camera similar to Kinect appears, the depth information captured by the sensor brings a new scheme for the problem that the traditional two-dimensional image-based depth camera is difficult to break through. The widespread use of depth cameras has also made gesture estimation an increasingly important research focus in computer vision. Gesture estimation is the extraction of hand gestures from image data, depth data, etc. in the form of joint points, i.e. the whole hand part is characterized with a small number of key points. The method can not only reduce the calculation burden of the algorithm, but also effectively improve the recognition degree of the gesture and further reduce the influence brought by the region outside the target.

Traditional gesture recognition algorithms can be mainly classified into three major categories, namely, template-based algorithms, feature classification-based algorithms and probability statistics-based algorithms.

The Template-based algorithm mainly comprises Template Matching (Template Matching), for example, Mahbub carries out recognition by Matching the gesture to be recognized with templates in a gesture Template library trained in advance one by one; dynamic Programming (DP), such as Tew, by converting a multi-stage decision process into a plurality of single-stage problems, solving one by one, and solving a global decision result by recursive computation; dynamic Time Warping (DTW), such as Hartmann, can eliminate the difference in speed when different people make the same gesture by adjusting the gesture sequence on the Time axis.

An algorithm based on probability statistics mainly includes a dynamic Bayesian Network (BN), for example, Suk fuses timing information on the basis of a static Bayesian Network, and is used for dynamic gesture recognition; hidden Markov Models (HMMs), such as lees, take advantage of time series analysis to achieve good results in gesture recognition.

A commonly-used classification method based on a feature classification algorithm mainly comprises a Support Vector Machine (SVM), for example, Ghaleb performs classification in gesture recognition by combining with high-dimensional mode features; AdaBoost (adaptive boosting) such as Ding trains some weak classifiers according to a sample set, and the weak classifiers are combined according to a certain weight to finally form a strong classifier for gesture recognition.

the above traditional machine learning method has been developed in a long time to a certain extent, but still has the following disadvantages: (1) most of the characteristic extraction modes of the algorithm or the model are designed for specific gestures, the characteristic selection depends on the self experience of researchers, the uncertainty is large, the scale of the model parameters is limited by a mechanism for manually setting the parameters, and the application scene is limited; (2) the number and diversity of samples are not abundant, and the robustness and adaptability of the algorithm are not strong enough.

The deep learning network has a very strong self-learning capability, but needs a huge data volume and outstanding support of computing capability, so the deep learning network has not received industrial attention in the early stage. As convolutional neural networks make a breakthrough in the field of computer vision and the introduction of GPU technology that can be used to accelerate computations, researchers studying deep learning in various fields have seen explosive growth. More and more researchers apply the deep convolutional network model to a gesture recognition system, for example, Garcia Brandon proposes a method for recognizing the american sign language letters in real time, which is based on transfer learning and uses an ASL data set on a pre-trained *** net model, and can reliably recognize a-e and roughly recognize a-k. Neverova integrates the multi-modal data to construct a multi-scale convolutional neural network for learning feature representations of multiple time and space dimensions. Molchanov uses 3DCNN to extract the driver's gestures from the depth image while avoiding the over-fitting problem of training with a data enhancement method. Wu makes changes from the network structure, employs a dual-stream convolutional network for time and space, and is used for authentication. The Pigout provides an end-to-end model based on a time convolution kernel bidirectional cyclic neural network, and improves gesture recognition speed based on videos. These methods achieve better performance in some respects.

The above methods generally extract features from the structural features of the image itself by using a priori knowledge. However, due to the complexity of the hand joints, the degree of freedom of the joint points is high, so that it is very difficult to accurately estimate the posture with only visual cues. Therefore, many researchers use the pose model construction method in human pose estimation to analyze hand pose from the perspective of hand joint points. Therefore, the node estimation is also commonly referred to as pose estimation, and the gesture state estimation is discussed here. Gestures estimate hand joint nodes and the connections between nodes in an image of interest. In gesture estimation based recognition methods, it is first of all to detect the hand joint points and then process the spatial distribution of these nodes to characterize the hand gesture. Gesture estimation can also be generalized to vision-based recognition algorithms.

Early gesture estimation was based on 2D joint studies, and was directed to some specific, more differentiated databases. The 2D joint point estimation has a good effect on simple and clear hand gestures, but cannot cope with the finger collision problem caused by self-occlusion and similar color of fingers. Before 2010, the only input information for gesture estimation was a color image data stream, which increased the difficulty of gesture estimation. The researcher predicts the three-dimensional information of the joint points from the detected positions and contours of the human hand. Ballan et al use an explicit model under a multi-view camera system to infer gestures by matching a predefined gesture database. Sharp performs tracking based on the initial pose to achieve pose estimation.

the 3D representation of the joint points has more information in space than the 2D joint points, which can effectively alleviate the above problems. With the advent of low-cost depth cameras, research has focused primarily on RGB-D based gesture estimation. Tompson uses CNN to detect hand joint points in two dimensions, conditioned on a multi-resolution image pyramid, to recover pose in three dimensions by solving an inverse kinematics optimization problem. Like Zhou, which explores the possibility of correlating coordinates of joint points in the coding compression bottleneck, or oberwiger, which trains a CNN to directly regress the three-dimensional coordinates of a given hand-clipped depth map, Zhou estimates the angles between the bones of the motion chain instead of cartesian coordinates. Oberwger utilizes the CNN that can estimate a synthesized depth map from a given pose. This allows them to continuously improve the initial pose estimation by minimizing the distance between the observed and synthesized depth images. Pioneering works consider using RGB images for gesture estimation, and propose a multi-view bootstrap method to train a single RGB image hand joint detector. Most of the current 3D gesture estimation methods are based on active sensors such as depth cameras, which are easily interfered by other active components, and depth cameras are not commonly used like ordinary color cameras, only work in a reliable indoor environment, and are not suitable for outdoor environments and mobile devices.

compared with the prior art, the method has the advantages of simple RGB data acquisition, good equipment portability and wider application scenes. Full-gesture 3D estimation methods based on RGB data sources alone, although challenging, have started to focus on this area. Gesture estimation and deep learning are both vision-based dominant methods. The efficient combination of gesture estimation and deep learning and application to gesture recognition is somewhat challenging and novel.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a gesture recognition method based on a convolutional neural network and 3D estimation to solve the above technical defects, aiming at the technical problems that the current 3D gesture estimation method is imperfect and limited to outdoor environments and mobile devices.

a gesture recognition method based on a convolutional neural network and 3D estimation comprises the following steps:

The method comprises the following steps: processing the image to be recognized by adopting a SegNet-base network model so as to extract a hand mask feature map in the image to be recognized;

Step two: constructing a depth convolution network DetectNet based on supervised learning, and carrying out network positioning on hand joint points in the hand mask feature map to obtain a gesture recognition result of 2D estimation;

Step three: and processing the gesture recognition result of the 2D estimation by adopting a PoseNormNet model based on the standard frame and the viewpoint estimation to obtain a gesture recognition result of the 3D estimation.

Further, in the first step, a SegNet-base network model is adopted, and firstly, the target in the image is divided into two parts: detecting the hand and the background by adopting a convolutional neural network; the network layer is composed of a plurality of convolution blocks and a full convolution layer, each convolution block is formed in a continuous convolution mode, and convolution blocks are convoluted in an 'SAME' mode.

Further, in the second step, a convolutional neural network CNN model is used to construct a hand joint point detection network detectenet, a network SegNet-base is combined to form a 2D hand joint point detection model PoseNet2D, and the 2D hand joint point detection model PoseNet2D is used to process the hand mask feature map to obtain a 2D estimated gesture recognition result.

Further, in step three, the input dimension of the posenornnet model is set according to the 2D hand joint detection model PoseNet 2D.

Compared with the prior art, the invention has the advantages that:

On the basis of the existing network model, a cross-layer connection and deconvolution mode is added to obtain an improved network model. The network outputs a maximum hand segmentation mask to locate the most significant hand components in the image. Three-dimensional estimation of nodes, a 3D gesture estimation network based on canonical frame and viewpoint estimation is proposed, through which 3D hand pose distributions are inferred from 2D key nodes. And finally, 3D hand posture distribution output by the third network is used as characteristic input of the classifier to construct a full-connection network based on softmax for gesture recognition, so that the reliability and precision of the gesture recognition are effectively improved.

Drawings

the invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of a gesture recognition method based on convolutional neural network and 3D estimation according to the present invention;

FIG. 2 is a schematic diagram of a SegNet-base split network architecture according to the present invention;

FIG. 3 is a diagram of the 2D gesture estimation network architecture of the present invention;

FIG. 4 is a hand joint point skeleton model diagram of the present invention;

FIG. 5 is a Pose NormNet model architecture diagram of the present invention;

FIG. 6 is a PCK curve diagram of DetectNet of the present invention under the error threshold of 0-30 pixel;

FIG. 7 is a PCK comparison of the DecectNet model of the present invention and Posenet2D model;

FIG. 8 is a graph of the detection effect of PoseNet2D of the present invention on an RHD data set;

FIG. 9 is a diagram of the PCK of the PoseNormNet of the present invention compared to other models in the RHD dataset.

Detailed Description

For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

The SegNet-base hand segmentation network aims to extract hands and focus the attention points of subsequent operations on the hands. Which comprises two steps: in the first step, this chapter divides the objects in the image into two categories: and detecting the hand and the background by adopting a convolutional neural network. In a second step, the feature map generated in the first step is up-sampled to the original image size, and then a softmax classifier is applied to compare with the real label to constrain the actual output. As shown in fig. 2, the process flow of the adopted hand-segmentation network architecture is an end-to-end trainable network model.

The SegNet-base network model converts the hand positioning problem into a segmentation problem to extract a hand mask feature map. The network layer is composed of a plurality of convolution blocks and a full convolution layer, each convolution block is formed in a continuous convolution mode, and convolution blocks are convoluted in an 'SAME' mode. The input of the network is set to 256 × 256 × 3, so that the first step of designing the split network is completed. The second step of the segmentation network adopts a common bilinear interpolation upsampling mode; and (3) the SegNet-base only directly performs 8 times of upsampling on the feature map extracted by the last convolution aggregation block, restores the feature map from 32 multiplied by 32 to 256 multiplied by 256, and the network outputs a hand part mask corresponding to the maximum score.

TABLE 1 hand-splitting network architecture

And (3) combining reasonable priori knowledge in the original color image data, learning the distribution condition of hand joint points to indirectly describe the position of the target, and constructing a 2D gesture estimation framework, as shown in FIG. 3.

The hand posture model adopted is improved based on a Linear-blend-skiing (LBS) model, which is a general term for a model expressing the hand by degrees of Freedom (DOF) at joints and bones connecting the joints, and characterizes the whole hand posture with a small number of key points. The improvement is represented by the spatial coordinates of the joint points instead of degrees of freedom, thereby obtainingTo the hand gesture node representation model shown in fig. 4. It contains 21 joint points: the palm (or the wrist) and 4 of each finger. Giving a one-handed color imageN rows and M columns, 3 channels, defining a set of coordinates w_i＝(x_i,y_i) To describe the position of the ith joint point in two-dimensional space, i e [1, J ∈]herein J-21. During marking, numbers represent different joint points, wherein the palm (or wrist) is number 0, TIP, DIP, PIP and MCP of the thumb (T) are numbered 1-4 in sequence, the index finger (I) corresponds to numbers 5-8, the middle finger (M) corresponds to numbers 9-12, the ring finger (R) corresponds to numbers 13-16, the little finger (P) corresponds to numbers 17-20, and the right hand marking method is similar to that of the left hand.

The method for positioning the hand joint points in the two-dimensional image is completed on the basis of a segmentation network, and a scheme main body is composed of two cascaded depth network models: a first network provides hand detection and segmentation to locate hands; cutting to obtain a hand structure on the original drawing based on a segmentation graph output by a preceding stage network; a second network locates hand joint points in the two-dimensional image; based on the posture encoder-decoder framework, the positioning of two-dimensional joint points is regarded as the estimation problem of a two-dimensional score map, and a deep convolutional network based on supervised learning is designed and is recorded as DetectNet. The design of the DetectNet network layer is shown in Table 2, and the sequence from the input to the output of the network is shown from the top to the bottom in Table 2.

TABLE 2 DetectNet network layer design

Table 3 shows the case of each convolution block, where the number of layers indicates the number of consecutive convolutions in the convolution block, and the number of input channels of the convolution kernel is determined according to the input of the current layer. In order to make up for the influence caused by the characteristic loss due to the large number of network layers and the deep network refinement degree, a cross-layer connection mode is utilized in the middle layer. fc1, fc2, and fc3 represent full-convolution blocks, and the features at this stage are fused by a 1 × 1 convolution.

TABLE 3 respective volume Block parameters

Volume block name	Number of layers	Number of output channels
			conv1	2	64
conv2	2	128
			conv3	4	256
conv4	5	256
			conv5	6	128
conv6	6	128

the 3 conv + pool modules preliminarily extract the distribution characteristics of the hand joints, the subsequent part of the DetectNet network is divided into three stages, and the final purpose of each stage is to perform characteristic fusion but is different. The first stage is as follows: conv4, fc1 after pool3 is the fusion of the previous features, when the first rough predicted score fc1 — 2 is obtained; the input to the second stage consists of conv4_5 and fc1_ 2; the input of the third stage consists of conv4_5 and fc2_ 2; the latter two stages result in predicted scores fc2_2, fc3_2, respectively. The concat (D1, D2) operation represents the joining of the feature layer D1 and the feature layer D2 according to corresponding dimensions, enabling the combination of features at different stages. The sizes of the three prediction score maps are 32 × 32 × 21, and each feature map represents the position distribution of 21 joint points. The latter score is a more accurate representation of the position of the joint point from the previous convolution, so the position information of fc3_2 is the most refined and will be the basis of the network output.

The PoseNormNet network structure is shown in FIG. 5, and is a network based on canonical frame and viewpoint estimation, and includes two parts with similar structures: 3D coordinates w of the Upper learning Specification frame^cThe lower part learns a rotation matrix R between an initial frame and a standard frame, and finally obtains a normalized relative coordinate w^rel. It is characterized in that: the upper (lower) part network expands the convolution layer number, replaces the pooling layer by the convolution mode, and increases the left-hand and right-hand information. To adapt the output results of the two-dimensional estimation of PoseNet2D, the input dimensions are set to 32 × 32 × 21, so the input to the PoseNormNet model actually takes the node score map, rather than the coordinates in two dimensions.

The specific scale of the layers of the upper network is shown in table 4, the parameters of only each layer of the lower network are different from those of the upper network, and the values in the last column of brackets represent the channel dimensions of the features of the lower network. In the table, s-2 represents that the convolution step size of the layer is 2, which corresponds to pooling the features. A ReLUctant activation function and a dropout operation with a loss rate of 0.2 are used in the full connection layer FC. When constructing the upper network, D is three-dimensional coordinates of 63 three-dimensional correspondences to 21 nodes; when constructing the lower network, D is 3 for the three axis angle parameters of the rotation matrix.

TABLE 4 concrete Scale of PoseNormNet model Upper network

Example 1: reconstruction result of test sample after CBN-VAE network training

The data set adopted in this embodiment is a virtual data set (RHD) obtained by synthesizing a three-dimensional animated character model and a natural background, 16 characters are randomly extracted, and a sample of 31 actions performed by the characters is divided into a training data set, which includes 41258 images of 320 × 320 × 3 images and is recorded as RHD _ train. The other 4 corners and the samples of the other 8 actions performed by them are divided into test datasets containing 2728 images 320 × 320 × 3, denoted as RHD _ test.

The SegNet-base and SegNet-prop networks were trained and the network performance was evaluated on the RHD _ test set, with no other operations except for resizing on the test data, and the evaluation results are shown in Table 5.

TABLE 5 RHD dataset Performance comparison

Evaluation results show that the SegNet-prop model can more effectively realize hand segmentation compared with the SegNet-base of simple linear interpolation up-sampling.

Example 2: training was performed on the RHD _ train dataset and STB _ train dataset, respectively, and then evaluation was performed on the RHD _ test dataset and STB _ test dataset, respectively, with the results of the evaluation shown in table 6 and fig. 6. The EPE difference between the two is shown in Table 6, the endpoint error in the table is pixel unit, the AUC of the error threshold value range from 0 to 30 is calculated, the arrow points represent the variation of the performance level along with the value size, and the upward represents that the performance is better when the value is larger.

Table 6 DetectNet performance evaluation

	EPE mean(px)↓	EPE median(px)↓	AUC(0～30)↑
				RHD_test	11.619	9.427	0.814
STB_test	8.823	8.164	0.917

According to the PCK curves of the DetectNet network under different end point error thresholds, the PCK value of the model can be increased quickly when the error threshold is smaller than 15 pixels; when the error threshold is greater than 15 pixels, PCK grows slowly. The smaller the error threshold value is, the higher the requirement on network output is, so that when the error threshold value is set to be between 10 and 15, a model with the optimal performance at the stage can be obtained. And meanwhile, the performance of the EPE mean and the EPE mean proves the feasibility of the DetectNet network in the detection of the 2D joint point, and the performance is good.

example 3: evaluation of a complete 2D hand joint point detection network

The complete hand-joint detection network, herein designated PoseNet2D, is formed by a cascaded hand-split network SegNet-prop and a joint detection network detectetnet. The hand-split network is first trained on RHD _ train only, i.e. the conclusions in the previous section embodiment are taken. And then obtaining a better DetectNet model on a Joint _ train training set, and finally testing the performance of the Posenet2D network on an RHD _ test, an STB _ test and a Dexter data set respectively. As shown in FIG. 7, the solid line graph represents the PCK curves of the DetectNet model under the RHD _ test set and the STB _ test set, respectively, and the dashed line graph represents the PCK curves of the PoseNet2D model under the RHD _ train, the STB _ test, and the Dexter test set, respectively.

The results of the examples show that: the PoseNet2D model is well suited for RHD and STB datasets and performs well, providing efficient input to the node detection network.

Example 4: representative results of joint detection on RHD data sets using the PoseNet2D model are shown in fig. 8, where the first row shows hand joint bone connection and the bottom row shows hand joint point distribution.

In the embodiment, the first two rows are the hand self-shielding condition, and the second two rows show the simple hand action condition, so that the hand joints can be detected. Note that there are two hands in the second column of samples, which is consistent with the desired output, since the SegNet-prop split network outputs only the hand with the large prediction score, then the input to detectenet also has a crop map that contains only one hand.

Example 5: evaluation of PoseNormNet networks

In the embodiment of this section, the initial learning rate is set to be 40000 maximum iterations, and the learning rate decreases by 10 times every 10000 iterations. All weights are initialized randomly. To avoid overfitting of the network, the present embodiment takes two strategies: (1) the randomness of the truth score map is enhanced by using Gaussian noise with the variance of 1.5 pixels to interfere with the key point positions; (2) and carrying out random scaling and overturning on the truth score map to further enhance the randomness. In this embodiment, a PoseNormNet model is learned in the RHD _ train data set, and the RHD _ test set is evaluated and compared with a Direct model, a Bottleneck model and a Local model, and the evaluation results are shown in Table 7 and FIG. 9.

TABLE 7 Local model Performance evaluation

Table 7 reflects the degradation of both EPE mean and EPE mean on the RHD for the improved model posenornet, and the examples show that the improved model based on the relative normalized canonical frame processing and viewpoint estimation achieves better performance than the direct, local and bottleeck models.

In summary, the gesture recognition method based on the convolutional neural network and the 3D estimation of the present invention includes:

(1) And (3) hand segmentation, namely adding a cross-layer connection and deconvolution mode on the basis of the existing network model to obtain an improved network model. The network outputs a maximum hand segmentation mask to locate the most significant hand components in the image.

(2) And the hand joint point detection network is used for detecting the hand joint points in the cutting graph and outputting a distribution score graph of each node, and the score graph records the position distribution information of the nodes on the image plane.

(3) Three-dimensional estimation of nodes, a 3D gesture estimation network based on canonical frames and viewpoint estimation is proposed, through which 3D hand pose distributions are inferred from 2D joint points.

And finally, taking the 3D hand posture distribution output by the third network as the characteristic input of a classifier to construct a full-connection network based on softmax for gesture recognition. Obtaining an average recognition rate of 94% on a custom gesture database constructed based on the STB dataset; the average recognition rate of 69.3% is obtained on the RETH database, the precision is improved by nearly 7% compared with that of the original text, and the feasibility of the RGB-based 3D gesture estimation method is proved. The invention effectively improves the reliability and precision of gesture recognition.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A gesture recognition method based on a convolutional neural network and 3D estimation is characterized by comprising the following steps:

2. The method for recognizing the gesture based on the convolutional neural network and the 3D estimation as claimed in claim 1, wherein in the first step, a SegNet-base network model is adopted to firstly divide the target in the image into two parts: detecting the hand and the background by adopting a convolutional neural network; the network layer is composed of a plurality of convolution blocks and a full convolution layer, each convolution block is formed in a continuous convolution mode, and convolution blocks are convoluted in an 'SAME' mode.

3. The method for gesture recognition based on convolutional neural network and 3D estimation as claimed in claim 1, wherein in step two, a convolutional neural network CNN model is used to construct a hand joint point detection network DetectNet, a network SegNet-base is combined to form a 2D hand joint point detection model PoseNet2D, and the 2D hand joint point detection model PoseNet2D is used to process a hand mask feature map to obtain a 2D estimated gesture recognition result.

4. The method for gesture recognition based on convolutional neural network and 3D estimation as claimed in claim 1, wherein in step three, the input dimension of PoseNormNet model is set according to 2D hand joint detection model PoseNet 2D.