CN110555383A - Gesture recognition method based on convolutional neural network and 3D estimation - Google Patents

Gesture recognition method based on convolutional neural network and 3D estimation Download PDF

Info

Publication number
CN110555383A
CN110555383A CN201910703355.XA CN201910703355A CN110555383A CN 110555383 A CN110555383 A CN 110555383A CN 201910703355 A CN201910703355 A CN 201910703355A CN 110555383 A CN110555383 A CN 110555383A
Authority
CN
China
Prior art keywords
network
estimation
hand
gesture recognition
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910703355.XA
Other languages
Chinese (zh)
Inventor
陈分雄
蒋伟
王晓莉
熊鹏涛
韩荣
叶佳慧
王杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN201910703355.XA priority Critical patent/CN110555383A/en
Publication of CN110555383A publication Critical patent/CN110555383A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/113Recognition of static hand signs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • General Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a gesture recognition method based on a convolutional neural network and 3D estimation, which comprises the following steps of: processing the image to be recognized by adopting a SegNet-base network model so as to extract a hand mask feature map in the image to be recognized; constructing a depth convolution network DetectNet based on supervised learning, and carrying out network positioning on hand joint points in the hand mask feature map to obtain a gesture recognition result of 2D estimation; and processing the gesture recognition result of the 2D estimation by adopting a PoseNormNet model based on the standard frame and the viewpoint estimation to obtain a gesture recognition result of the 3D estimation.

Description

Gesture recognition method based on convolutional neural network and 3D estimation
Technical Field
the invention relates to the technical field of artificial intelligence, in particular to a gesture recognition method based on a convolutional neural network and 3D estimation.
background
Gestures refer to actions performed by hands, including actions of palms, fingers or arms, and gesture recognition, which is the core in novel human-computer interaction, is the central importance in the whole interactive system. The interactive interface based on gesture control is intuitive and easy to operate, user experience can be greatly improved, the interactive interface has wide application in life, and the application field comprises: the virtual reality field, the intelligent home field, the medical field and the like. In addition, the gesture recognition has very important practical significance for the fields of assisting automobile driving, sign language translation and the like. The recognition of both simple static hand movements and complex dynamic hand movements is an important interface for human-computer interaction. Static gestures are one posture state of the hand at a time, and the relevance between different static gestures is small. A dynamic gesture is a sequence of movements of the hand for a short and continuous time, i.e. a set of static gestures spread out in time, but there is a strong correlation between these static gestures. The use of a dynamic gesture recognition system increases the computational requirements of the device and increases the cost of the product, while a stable and efficient static gesture recognition system can accomplish a considerable amount of things with less system resources.
the hand structure has many joints and high degree of freedom, which causes serious occlusion of the hand posture and very high similarity between partial actions. And therefore require an appropriate representation to construct the hand structure. For a long time, the research of gesture recognition based on vision is limited to the traditional mode, and the method often depends on the selection of manual features, such as skin color and hand texture, to detect the gesture.
After a depth camera similar to Kinect appears, the depth information captured by the sensor brings a new scheme for the problem that the traditional two-dimensional image-based depth camera is difficult to break through. The widespread use of depth cameras has also made gesture estimation an increasingly important research focus in computer vision. Gesture estimation is the extraction of hand gestures from image data, depth data, etc. in the form of joint points, i.e. the whole hand part is characterized with a small number of key points. The method can not only reduce the calculation burden of the algorithm, but also effectively improve the recognition degree of the gesture and further reduce the influence brought by the region outside the target.
Traditional gesture recognition algorithms can be mainly classified into three major categories, namely, template-based algorithms, feature classification-based algorithms and probability statistics-based algorithms.
The Template-based algorithm mainly comprises Template Matching (Template Matching), for example, Mahbub carries out recognition by Matching the gesture to be recognized with templates in a gesture Template library trained in advance one by one; dynamic Programming (DP), such as Tew, by converting a multi-stage decision process into a plurality of single-stage problems, solving one by one, and solving a global decision result by recursive computation; dynamic Time Warping (DTW), such as Hartmann, can eliminate the difference in speed when different people make the same gesture by adjusting the gesture sequence on the Time axis.
An algorithm based on probability statistics mainly includes a dynamic Bayesian Network (BN), for example, Suk fuses timing information on the basis of a static Bayesian Network, and is used for dynamic gesture recognition; hidden Markov Models (HMMs), such as lees, take advantage of time series analysis to achieve good results in gesture recognition.
A commonly-used classification method based on a feature classification algorithm mainly comprises a Support Vector Machine (SVM), for example, Ghaleb performs classification in gesture recognition by combining with high-dimensional mode features; AdaBoost (adaptive boosting) such as Ding trains some weak classifiers according to a sample set, and the weak classifiers are combined according to a certain weight to finally form a strong classifier for gesture recognition.
the above traditional machine learning method has been developed in a long time to a certain extent, but still has the following disadvantages: (1) most of the characteristic extraction modes of the algorithm or the model are designed for specific gestures, the characteristic selection depends on the self experience of researchers, the uncertainty is large, the scale of the model parameters is limited by a mechanism for manually setting the parameters, and the application scene is limited; (2) the number and diversity of samples are not abundant, and the robustness and adaptability of the algorithm are not strong enough.
The deep learning network has a very strong self-learning capability, but needs a huge data volume and outstanding support of computing capability, so the deep learning network has not received industrial attention in the early stage. As convolutional neural networks make a breakthrough in the field of computer vision and the introduction of GPU technology that can be used to accelerate computations, researchers studying deep learning in various fields have seen explosive growth. More and more researchers apply the deep convolutional network model to a gesture recognition system, for example, Garcia Brandon proposes a method for recognizing the american sign language letters in real time, which is based on transfer learning and uses an ASL data set on a pre-trained *** net model, and can reliably recognize a-e and roughly recognize a-k. Neverova integrates the multi-modal data to construct a multi-scale convolutional neural network for learning feature representations of multiple time and space dimensions. Molchanov uses 3DCNN to extract the driver's gestures from the depth image while avoiding the over-fitting problem of training with a data enhancement method. Wu makes changes from the network structure, employs a dual-stream convolutional network for time and space, and is used for authentication. The Pigout provides an end-to-end model based on a time convolution kernel bidirectional cyclic neural network, and improves gesture recognition speed based on videos. These methods achieve better performance in some respects.
The above methods generally extract features from the structural features of the image itself by using a priori knowledge. However, due to the complexity of the hand joints, the degree of freedom of the joint points is high, so that it is very difficult to accurately estimate the posture with only visual cues. Therefore, many researchers use the pose model construction method in human pose estimation to analyze hand pose from the perspective of hand joint points. Therefore, the node estimation is also commonly referred to as pose estimation, and the gesture state estimation is discussed here. Gestures estimate hand joint nodes and the connections between nodes in an image of interest. In gesture estimation based recognition methods, it is first of all to detect the hand joint points and then process the spatial distribution of these nodes to characterize the hand gesture. Gesture estimation can also be generalized to vision-based recognition algorithms.
Early gesture estimation was based on 2D joint studies, and was directed to some specific, more differentiated databases. The 2D joint point estimation has a good effect on simple and clear hand gestures, but cannot cope with the finger collision problem caused by self-occlusion and similar color of fingers. Before 2010, the only input information for gesture estimation was a color image data stream, which increased the difficulty of gesture estimation. The researcher predicts the three-dimensional information of the joint points from the detected positions and contours of the human hand. Ballan et al use an explicit model under a multi-view camera system to infer gestures by matching a predefined gesture database. Sharp performs tracking based on the initial pose to achieve pose estimation.
the 3D representation of the joint points has more information in space than the 2D joint points, which can effectively alleviate the above problems. With the advent of low-cost depth cameras, research has focused primarily on RGB-D based gesture estimation. Tompson uses CNN to detect hand joint points in two dimensions, conditioned on a multi-resolution image pyramid, to recover pose in three dimensions by solving an inverse kinematics optimization problem. Like Zhou, which explores the possibility of correlating coordinates of joint points in the coding compression bottleneck, or oberwiger, which trains a CNN to directly regress the three-dimensional coordinates of a given hand-clipped depth map, Zhou estimates the angles between the bones of the motion chain instead of cartesian coordinates. Oberwger utilizes the CNN that can estimate a synthesized depth map from a given pose. This allows them to continuously improve the initial pose estimation by minimizing the distance between the observed and synthesized depth images. Pioneering works consider using RGB images for gesture estimation, and propose a multi-view bootstrap method to train a single RGB image hand joint detector. Most of the current 3D gesture estimation methods are based on active sensors such as depth cameras, which are easily interfered by other active components, and depth cameras are not commonly used like ordinary color cameras, only work in a reliable indoor environment, and are not suitable for outdoor environments and mobile devices.
compared with the prior art, the method has the advantages of simple RGB data acquisition, good equipment portability and wider application scenes. Full-gesture 3D estimation methods based on RGB data sources alone, although challenging, have started to focus on this area. Gesture estimation and deep learning are both vision-based dominant methods. The efficient combination of gesture estimation and deep learning and application to gesture recognition is somewhat challenging and novel.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a gesture recognition method based on a convolutional neural network and 3D estimation to solve the above technical defects, aiming at the technical problems that the current 3D gesture estimation method is imperfect and limited to outdoor environments and mobile devices.
a gesture recognition method based on a convolutional neural network and 3D estimation comprises the following steps:
The method comprises the following steps: processing the image to be recognized by adopting a SegNet-base network model so as to extract a hand mask feature map in the image to be recognized;
Step two: constructing a depth convolution network DetectNet based on supervised learning, and carrying out network positioning on hand joint points in the hand mask feature map to obtain a gesture recognition result of 2D estimation;
Step three: and processing the gesture recognition result of the 2D estimation by adopting a PoseNormNet model based on the standard frame and the viewpoint estimation to obtain a gesture recognition result of the 3D estimation.
Further, in the first step, a SegNet-base network model is adopted, and firstly, the target in the image is divided into two parts: detecting the hand and the background by adopting a convolutional neural network; the network layer is composed of a plurality of convolution blocks and a full convolution layer, each convolution block is formed in a continuous convolution mode, and convolution blocks are convoluted in an 'SAME' mode.
Further, in the second step, a convolutional neural network CNN model is used to construct a hand joint point detection network detectenet, a network SegNet-base is combined to form a 2D hand joint point detection model PoseNet2D, and the 2D hand joint point detection model PoseNet2D is used to process the hand mask feature map to obtain a 2D estimated gesture recognition result.
Further, in step three, the input dimension of the posenornnet model is set according to the 2D hand joint detection model PoseNet 2D.
Compared with the prior art, the invention has the advantages that:
On the basis of the existing network model, a cross-layer connection and deconvolution mode is added to obtain an improved network model. The network outputs a maximum hand segmentation mask to locate the most significant hand components in the image. Three-dimensional estimation of nodes, a 3D gesture estimation network based on canonical frame and viewpoint estimation is proposed, through which 3D hand pose distributions are inferred from 2D key nodes. And finally, 3D hand posture distribution output by the third network is used as characteristic input of the classifier to construct a full-connection network based on softmax for gesture recognition, so that the reliability and precision of the gesture recognition are effectively improved.
Drawings
the invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flow chart of a gesture recognition method based on convolutional neural network and 3D estimation according to the present invention;
FIG. 2 is a schematic diagram of a SegNet-base split network architecture according to the present invention;
FIG. 3 is a diagram of the 2D gesture estimation network architecture of the present invention;
FIG. 4 is a hand joint point skeleton model diagram of the present invention;
FIG. 5 is a Pose NormNet model architecture diagram of the present invention;
FIG. 6 is a PCK curve diagram of DetectNet of the present invention under the error threshold of 0-30 pixel;
FIG. 7 is a PCK comparison of the DecectNet model of the present invention and Posenet2D model;
FIG. 8 is a graph of the detection effect of PoseNet2D of the present invention on an RHD data set;
FIG. 9 is a diagram of the PCK of the PoseNormNet of the present invention compared to other models in the RHD dataset.
Detailed Description
For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
The SegNet-base hand segmentation network aims to extract hands and focus the attention points of subsequent operations on the hands. Which comprises two steps: in the first step, this chapter divides the objects in the image into two categories: and detecting the hand and the background by adopting a convolutional neural network. In a second step, the feature map generated in the first step is up-sampled to the original image size, and then a softmax classifier is applied to compare with the real label to constrain the actual output. As shown in fig. 2, the process flow of the adopted hand-segmentation network architecture is an end-to-end trainable network model.
The SegNet-base network model converts the hand positioning problem into a segmentation problem to extract a hand mask feature map. The network layer is composed of a plurality of convolution blocks and a full convolution layer, each convolution block is formed in a continuous convolution mode, and convolution blocks are convoluted in an 'SAME' mode. The input of the network is set to 256 × 256 × 3, so that the first step of designing the split network is completed. The second step of the segmentation network adopts a common bilinear interpolation upsampling mode; and (3) the SegNet-base only directly performs 8 times of upsampling on the feature map extracted by the last convolution aggregation block, restores the feature map from 32 multiplied by 32 to 256 multiplied by 256, and the network outputs a hand part mask corresponding to the maximum score.
TABLE 1 hand-splitting network architecture
And (3) combining reasonable priori knowledge in the original color image data, learning the distribution condition of hand joint points to indirectly describe the position of the target, and constructing a 2D gesture estimation framework, as shown in FIG. 3.
The hand posture model adopted is improved based on a Linear-blend-skiing (LBS) model, which is a general term for a model expressing the hand by degrees of Freedom (DOF) at joints and bones connecting the joints, and characterizes the whole hand posture with a small number of key points. The improvement is represented by the spatial coordinates of the joint points instead of degrees of freedom, thereby obtainingTo the hand gesture node representation model shown in fig. 4. It contains 21 joint points: the palm (or the wrist) and 4 of each finger. Giving a one-handed color imageN rows and M columns, 3 channels, defining a set of coordinates wi=(xi,yi) To describe the position of the ith joint point in two-dimensional space, i e [1, J ∈]herein J-21. During marking, numbers represent different joint points, wherein the palm (or wrist) is number 0, TIP, DIP, PIP and MCP of the thumb (T) are numbered 1-4 in sequence, the index finger (I) corresponds to numbers 5-8, the middle finger (M) corresponds to numbers 9-12, the ring finger (R) corresponds to numbers 13-16, the little finger (P) corresponds to numbers 17-20, and the right hand marking method is similar to that of the left hand.
The method for positioning the hand joint points in the two-dimensional image is completed on the basis of a segmentation network, and a scheme main body is composed of two cascaded depth network models: a first network provides hand detection and segmentation to locate hands; cutting to obtain a hand structure on the original drawing based on a segmentation graph output by a preceding stage network; a second network locates hand joint points in the two-dimensional image; based on the posture encoder-decoder framework, the positioning of two-dimensional joint points is regarded as the estimation problem of a two-dimensional score map, and a deep convolutional network based on supervised learning is designed and is recorded as DetectNet. The design of the DetectNet network layer is shown in Table 2, and the sequence from the input to the output of the network is shown from the top to the bottom in Table 2.
TABLE 2 DetectNet network layer design
Table 3 shows the case of each convolution block, where the number of layers indicates the number of consecutive convolutions in the convolution block, and the number of input channels of the convolution kernel is determined according to the input of the current layer. In order to make up for the influence caused by the characteristic loss due to the large number of network layers and the deep network refinement degree, a cross-layer connection mode is utilized in the middle layer. fc1, fc2, and fc3 represent full-convolution blocks, and the features at this stage are fused by a 1 × 1 convolution.
TABLE 3 respective volume Block parameters
Volume block name Number of layers Number of output channels
conv1 2 64
conv2 2 128
conv3 4 256
conv4 5 256
conv5 6 128
conv6 6 128
the 3 conv + pool modules preliminarily extract the distribution characteristics of the hand joints, the subsequent part of the DetectNet network is divided into three stages, and the final purpose of each stage is to perform characteristic fusion but is different. The first stage is as follows: conv4, fc1 after pool3 is the fusion of the previous features, when the first rough predicted score fc1 — 2 is obtained; the input to the second stage consists of conv4_5 and fc1_ 2; the input of the third stage consists of conv4_5 and fc2_ 2; the latter two stages result in predicted scores fc2_2, fc3_2, respectively. The concat (D1, D2) operation represents the joining of the feature layer D1 and the feature layer D2 according to corresponding dimensions, enabling the combination of features at different stages. The sizes of the three prediction score maps are 32 × 32 × 21, and each feature map represents the position distribution of 21 joint points. The latter score is a more accurate representation of the position of the joint point from the previous convolution, so the position information of fc3_2 is the most refined and will be the basis of the network output.
The PoseNormNet network structure is shown in FIG. 5, and is a network based on canonical frame and viewpoint estimation, and includes two parts with similar structures: 3D coordinates w of the Upper learning Specification framecThe lower part learns a rotation matrix R between an initial frame and a standard frame, and finally obtains a normalized relative coordinate wrel. It is characterized in that: the upper (lower) part network expands the convolution layer number, replaces the pooling layer by the convolution mode, and increases the left-hand and right-hand information. To adapt the output results of the two-dimensional estimation of PoseNet2D, the input dimensions are set to 32 × 32 × 21, so the input to the PoseNormNet model actually takes the node score map, rather than the coordinates in two dimensions.
The specific scale of the layers of the upper network is shown in table 4, the parameters of only each layer of the lower network are different from those of the upper network, and the values in the last column of brackets represent the channel dimensions of the features of the lower network. In the table, s-2 represents that the convolution step size of the layer is 2, which corresponds to pooling the features. A ReLUctant activation function and a dropout operation with a loss rate of 0.2 are used in the full connection layer FC. When constructing the upper network, D is three-dimensional coordinates of 63 three-dimensional correspondences to 21 nodes; when constructing the lower network, D is 3 for the three axis angle parameters of the rotation matrix.
TABLE 4 concrete Scale of PoseNormNet model Upper network
Example 1: reconstruction result of test sample after CBN-VAE network training
The data set adopted in this embodiment is a virtual data set (RHD) obtained by synthesizing a three-dimensional animated character model and a natural background, 16 characters are randomly extracted, and a sample of 31 actions performed by the characters is divided into a training data set, which includes 41258 images of 320 × 320 × 3 images and is recorded as RHD _ train. The other 4 corners and the samples of the other 8 actions performed by them are divided into test datasets containing 2728 images 320 × 320 × 3, denoted as RHD _ test.
The SegNet-base and SegNet-prop networks were trained and the network performance was evaluated on the RHD _ test set, with no other operations except for resizing on the test data, and the evaluation results are shown in Table 5.
TABLE 5 RHD dataset Performance comparison
Evaluation results show that the SegNet-prop model can more effectively realize hand segmentation compared with the SegNet-base of simple linear interpolation up-sampling.
Example 2: training was performed on the RHD _ train dataset and STB _ train dataset, respectively, and then evaluation was performed on the RHD _ test dataset and STB _ test dataset, respectively, with the results of the evaluation shown in table 6 and fig. 6. The EPE difference between the two is shown in Table 6, the endpoint error in the table is pixel unit, the AUC of the error threshold value range from 0 to 30 is calculated, the arrow points represent the variation of the performance level along with the value size, and the upward represents that the performance is better when the value is larger.
Table 6 DetectNet performance evaluation
EPE mean(px)↓ EPE median(px)↓ AUC(0~30)↑
RHD_test 11.619 9.427 0.814
STB_test 8.823 8.164 0.917
According to the PCK curves of the DetectNet network under different end point error thresholds, the PCK value of the model can be increased quickly when the error threshold is smaller than 15 pixels; when the error threshold is greater than 15 pixels, PCK grows slowly. The smaller the error threshold value is, the higher the requirement on network output is, so that when the error threshold value is set to be between 10 and 15, a model with the optimal performance at the stage can be obtained. And meanwhile, the performance of the EPE mean and the EPE mean proves the feasibility of the DetectNet network in the detection of the 2D joint point, and the performance is good.
example 3: evaluation of a complete 2D hand joint point detection network
The complete hand-joint detection network, herein designated PoseNet2D, is formed by a cascaded hand-split network SegNet-prop and a joint detection network detectetnet. The hand-split network is first trained on RHD _ train only, i.e. the conclusions in the previous section embodiment are taken. And then obtaining a better DetectNet model on a Joint _ train training set, and finally testing the performance of the Posenet2D network on an RHD _ test, an STB _ test and a Dexter data set respectively. As shown in FIG. 7, the solid line graph represents the PCK curves of the DetectNet model under the RHD _ test set and the STB _ test set, respectively, and the dashed line graph represents the PCK curves of the PoseNet2D model under the RHD _ train, the STB _ test, and the Dexter test set, respectively.
The results of the examples show that: the PoseNet2D model is well suited for RHD and STB datasets and performs well, providing efficient input to the node detection network.
Example 4: representative results of joint detection on RHD data sets using the PoseNet2D model are shown in fig. 8, where the first row shows hand joint bone connection and the bottom row shows hand joint point distribution.
In the embodiment, the first two rows are the hand self-shielding condition, and the second two rows show the simple hand action condition, so that the hand joints can be detected. Note that there are two hands in the second column of samples, which is consistent with the desired output, since the SegNet-prop split network outputs only the hand with the large prediction score, then the input to detectenet also has a crop map that contains only one hand.
Example 5: evaluation of PoseNormNet networks
In the embodiment of this section, the initial learning rate is set to be 40000 maximum iterations, and the learning rate decreases by 10 times every 10000 iterations. All weights are initialized randomly. To avoid overfitting of the network, the present embodiment takes two strategies: (1) the randomness of the truth score map is enhanced by using Gaussian noise with the variance of 1.5 pixels to interfere with the key point positions; (2) and carrying out random scaling and overturning on the truth score map to further enhance the randomness. In this embodiment, a PoseNormNet model is learned in the RHD _ train data set, and the RHD _ test set is evaluated and compared with a Direct model, a Bottleneck model and a Local model, and the evaluation results are shown in Table 7 and FIG. 9.
TABLE 7 Local model Performance evaluation
Table 7 reflects the degradation of both EPE mean and EPE mean on the RHD for the improved model posenornet, and the examples show that the improved model based on the relative normalized canonical frame processing and viewpoint estimation achieves better performance than the direct, local and bottleeck models.
In summary, the gesture recognition method based on the convolutional neural network and the 3D estimation of the present invention includes:
(1) And (3) hand segmentation, namely adding a cross-layer connection and deconvolution mode on the basis of the existing network model to obtain an improved network model. The network outputs a maximum hand segmentation mask to locate the most significant hand components in the image.
(2) And the hand joint point detection network is used for detecting the hand joint points in the cutting graph and outputting a distribution score graph of each node, and the score graph records the position distribution information of the nodes on the image plane.
(3) Three-dimensional estimation of nodes, a 3D gesture estimation network based on canonical frames and viewpoint estimation is proposed, through which 3D hand pose distributions are inferred from 2D joint points.
And finally, taking the 3D hand posture distribution output by the third network as the characteristic input of a classifier to construct a full-connection network based on softmax for gesture recognition. Obtaining an average recognition rate of 94% on a custom gesture database constructed based on the STB dataset; the average recognition rate of 69.3% is obtained on the RETH database, the precision is improved by nearly 7% compared with that of the original text, and the feasibility of the RGB-based 3D gesture estimation method is proved. The invention effectively improves the reliability and precision of gesture recognition.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (4)

1. A gesture recognition method based on a convolutional neural network and 3D estimation is characterized by comprising the following steps:
The method comprises the following steps: processing the image to be recognized by adopting a SegNet-base network model so as to extract a hand mask feature map in the image to be recognized;
step two: constructing a depth convolution network DetectNet based on supervised learning, and carrying out network positioning on hand joint points in the hand mask feature map to obtain a gesture recognition result of 2D estimation;
Step three: and processing the gesture recognition result of the 2D estimation by adopting a PoseNormNet model based on the standard frame and the viewpoint estimation to obtain a gesture recognition result of the 3D estimation.
2. The method for recognizing the gesture based on the convolutional neural network and the 3D estimation as claimed in claim 1, wherein in the first step, a SegNet-base network model is adopted to firstly divide the target in the image into two parts: detecting the hand and the background by adopting a convolutional neural network; the network layer is composed of a plurality of convolution blocks and a full convolution layer, each convolution block is formed in a continuous convolution mode, and convolution blocks are convoluted in an 'SAME' mode.
3. The method for gesture recognition based on convolutional neural network and 3D estimation as claimed in claim 1, wherein in step two, a convolutional neural network CNN model is used to construct a hand joint point detection network DetectNet, a network SegNet-base is combined to form a 2D hand joint point detection model PoseNet2D, and the 2D hand joint point detection model PoseNet2D is used to process a hand mask feature map to obtain a 2D estimated gesture recognition result.
4. The method for gesture recognition based on convolutional neural network and 3D estimation as claimed in claim 1, wherein in step three, the input dimension of PoseNormNet model is set according to 2D hand joint detection model PoseNet 2D.
CN201910703355.XA 2019-07-31 2019-07-31 Gesture recognition method based on convolutional neural network and 3D estimation Pending CN110555383A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910703355.XA CN110555383A (en) 2019-07-31 2019-07-31 Gesture recognition method based on convolutional neural network and 3D estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910703355.XA CN110555383A (en) 2019-07-31 2019-07-31 Gesture recognition method based on convolutional neural network and 3D estimation

Publications (1)

Publication Number Publication Date
CN110555383A true CN110555383A (en) 2019-12-10

Family

ID=68737194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910703355.XA Pending CN110555383A (en) 2019-07-31 2019-07-31 Gesture recognition method based on convolutional neural network and 3D estimation

Country Status (1)

Country Link
CN (1) CN110555383A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079661A (en) * 2019-12-19 2020-04-28 中国科学技术大学 Sign language recognition system
CN111259892A (en) * 2020-01-19 2020-06-09 福建升腾资讯有限公司 Method, device, equipment and medium for inspecting state of indicator light
CN111860330A (en) * 2020-07-21 2020-10-30 陕西工业职业技术学院 Apple leaf disease identification method based on multi-feature fusion and convolutional neural network
CN113283314A (en) * 2021-05-11 2021-08-20 桂林电子科技大学 Unmanned aerial vehicle night search and rescue method based on YOLOv3 and gesture recognition
CN114882493A (en) * 2021-01-22 2022-08-09 北京航空航天大学 Three-dimensional hand posture estimation and recognition method based on image sequence

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657634A (en) * 2018-12-26 2019-04-19 中国地质大学(武汉) A kind of 3D gesture identification method and system based on depth convolutional neural networks

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657634A (en) * 2018-12-26 2019-04-19 中国地质大学(武汉) A kind of 3D gesture identification method and system based on depth convolutional neural networks

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079661A (en) * 2019-12-19 2020-04-28 中国科学技术大学 Sign language recognition system
CN111079661B (en) * 2019-12-19 2022-07-15 中国科学技术大学 Sign language recognition system
CN111259892A (en) * 2020-01-19 2020-06-09 福建升腾资讯有限公司 Method, device, equipment and medium for inspecting state of indicator light
CN111259892B (en) * 2020-01-19 2023-07-04 福建升腾资讯有限公司 Inspection method, inspection device, inspection equipment and inspection medium for state of indicator lamp
CN111860330A (en) * 2020-07-21 2020-10-30 陕西工业职业技术学院 Apple leaf disease identification method based on multi-feature fusion and convolutional neural network
CN111860330B (en) * 2020-07-21 2023-08-11 陕西工业职业技术学院 Apple leaf disease identification method based on multi-feature fusion and convolutional neural network
CN114882493A (en) * 2021-01-22 2022-08-09 北京航空航天大学 Three-dimensional hand posture estimation and recognition method based on image sequence
CN113283314A (en) * 2021-05-11 2021-08-20 桂林电子科技大学 Unmanned aerial vehicle night search and rescue method based on YOLOv3 and gesture recognition

Similar Documents

Publication Publication Date Title
Bandini et al. Analysis of the hands in egocentric vision: A survey
Baradel et al. Pose-conditioned spatio-temporal attention for human action recognition
Cheng et al. Jointly network: a network based on CNN and RBM for gesture recognition
Liu et al. Multi-view hierarchical bidirectional recurrent neural network for depth video sequence based action recognition
Hasan et al. RETRACTED ARTICLE: Static hand gesture recognition using neural networks
Baradel et al. Human action recognition: Pose-based attention draws focus to hands
Molchanov et al. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network
CN110555383A (en) Gesture recognition method based on convolutional neural network and 3D estimation
Gao et al. Dynamic hand gesture recognition based on 3D hand pose estimation for human–robot interaction
Chao et al. A robot calligraphy system: From simple to complex writing by human gestures
Elgammal et al. Tracking people on a torus
Barros et al. A dynamic gesture recognition and prediction system using the convexity approach
CN109685037B (en) Real-time action recognition method and device and electronic equipment
Sincan et al. Using motion history images with 3d convolutional networks in isolated sign language recognition
CN110210426B (en) Method for estimating hand posture from single color image based on attention mechanism
Qi et al. Computer vision-based hand gesture recognition for human-robot interaction: a review
Gupta et al. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks
Chen et al. Learning a deep network with spherical part model for 3D hand pose estimation
CN111444488A (en) Identity authentication method based on dynamic gesture
CN111046734A (en) Multi-modal fusion sight line estimation method based on expansion convolution
Mathe et al. Arm gesture recognition using a convolutional neural network
Zhang et al. Handsense: smart multimodal hand gesture recognition based on deep neural networks
Avola et al. 3D hand pose and shape estimation from RGB images for keypoint-based hand gesture recognition
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
CN112906520A (en) Gesture coding-based action recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191210