CN112183198A

CN112183198A - Gesture recognition method for fusing body skeleton and head and hand part profiles

Info

Publication number: CN112183198A
Application number: CN202010851927.1A
Authority: CN
Inventors: 何坚; 廖俊杰; 张丞; 余立
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2021-01-05

Abstract

A gesture recognition method with fused body skeleton and head and hand part outlines belongs to the field of computer vision. The invention provides a method for describing human body gestures by combining human body skeleton gesture characteristics with gesture component outline characteristics. The invention uses the gesture component category identified by the contour detection network to represent the local information of the human body, so that the structure of the human body model is more complete. The invention cuts out a CPM network to construct a human body skeleton key node recognition network KEN, so that the real-time performance is enough, the recognition speed of 15 frames per second can be achieved in actual test, and the recognition precision is higher; the method has the characteristics of less parameter quantity, high operation speed and high identification precision.

Description

Gesture recognition method for fusing body skeleton and head and hand part profiles

Technical Field

The invention relates to a gesture recognition method for fusing limb skeleton and head and hand part profiles, which belongs to the field of electronic information and is a human gesture recognition method based on computer vision and applicable to human-computer interaction.

Background

Gestures are the most important way of non-verbal communication from person to person. Because gestures have characteristics of nature, various forms and the like, recognition of the gestures is an important field of human-computer interaction research. The recognition methods can be classified into contact gesture recognition and vision-based gesture recognition according to whether the gesture recognition apparatus is in contact with a body. The equipment (such as data gloves) used by contact type gesture recognition is complex and high in price, and the user can perform gesture recognition only after being familiar with the corresponding equipment, so that the natural expression of gestures is limited, and the natural interaction is not facilitated. Gesture recognition based on vision does not need expensive equipment, has the advantages of convenience in operation, nature and the like, more accords with the large trend of natural human-computer interaction, and has wide application prospects. The method based on computer vision is easy to realize, but the identification accuracy is easily influenced by factors such as background, illumination or human body gesture motion change. In recent years, the deep learning algorithm has excellent effects in the fields of image recognition, natural language processing and the like, and a new implementation method is provided for human gesture recognition.

Aiming at the problems existing in human body gesture recognition based on computer vision, the invention introduces a convolution gesture machine (CPM) based on deep learning, a Single Shot multiple box Detector (SSD) and Long Short Time Memory (LSTM) to perform human body gesture recognition.

Disclosure of Invention

Aiming at the problems that a human body Gesture recognition method based on computer vision is easily influenced by illumination, background and Gesture dynamic change and the like, the invention combines CPM, SSD and LSTM to construct a human body dynamic Gesture recognition machine (Gesture Recognizer based on Spatial Context and Temporal Feature Fusion, GRSCTFF) to extract the space-time characteristics of human body gestures, thereby realizing the rapid and accurate recognition of the human body gestures;

the invention specifically comprises the following steps:

(1) on the basis of analyzing the spatial context characteristics of the human body gestures, establishing a dynamic gesture model based on human body skeleton and component outline characteristics;

when gesture interaction is adopted, the form of the dynamic gesture mainly comprises the human skeleton form and the contours of the hand and the head, and the essence of the gesture is based on the combination of the relative positions of the skeleton joint points and each skeleton section (such as the length and the angle of the skeleton) and the external forms of the hand and the head. The human skeleton is formed by mutually linking key nodes of the skeleton. In the invention, "human skeleton key nodes" refer to key nodes included in a human skeleton link structure, and "parts" refer to hands, heads and feet with shape and contour characteristics. By taking 3-dimensional human body model thought as a reference, a universal gesture model which integrates the contour characteristics of human body skeletons, hands, heads and other parts is established.

(2) Adopting a convolution gesture machine and a single-shot multi-frame detector technology to construct a deep neural network to extract human body gesture skeleton and part outline characteristics, and combining the human body gesture skeleton and part outline characteristics into human body space context characteristics;

the gesture space context information is composed of a gesture skeleton configuration and a gesture component outline. The gesture skeleton configuration comprises the relative length characteristics of a human skeleton and the angle characteristics relative to the direction of gravitational acceleration, in order to extract the gesture skeleton configuration characteristics, a human skeleton key node extraction network needs to be established, the invention cuts the CPM depth by using the CPM thought, and constructs a human skeleton key node extraction network KEN comprising 3 stages:

setting Z as the set of all position coordinates (i, j) of the human skeleton in the image; using Y as the position of each key node of human skeleton in image_kIndicating that the human skeleton contains a total of 14 key nodes, hence Y_k∈{Y₁,…,Y₁₄}. KEN is composed of a series of multi-class predictors g_tCompositions trained to predict the location of each key node in the same image under different receptive fields. Specifically, g_t(. cndot.) is a classifier, and the subscript T ∈ {1, …, T } indicates the stages of classification, each stage having a different receptive field, where T is the last stage of the classifier. g_t(. DEG) predicting that the point Z (Z epsilon Z) in the image under the receptive field belongs to a key nodeY_kConfidence of (1), using b (Y)_kZ) (Z ∈ Z) represents a confidence value, then b_T(Y_kZ) represents the key node confidence for the z-coordinate point currently in the T-phase. These g_t(. cndot.) has the same objective function value (i.e., true confidence). When t is>1 time, g_t(. is a feature value x extracted from an image position z)_zAnd each key node Y_kAnd (4) splicing functions of the predicted values of the confidence degrees at the moment t-1. After T stages, the position with the highest confidence coefficient is the position of the key node, and argmax represents the key node Y_kAnd when the confidence coefficient is maximum, acquiring a function of the coordinate point z. Namely:

Y_k＝argmax(b_T(Y_k＝z)),k∈{1…14} (1)

based on the formula (1), the position of each key node in the human skeleton can be calculated, and a preliminary human skeleton form is established.

Fig. 1 shows a process of extracting gesture skeleton features. Phi is a₁(. -) represents a function of the human body skeleton key node to transform the human body limb vector; in addition, since the head height of the human body is a fixed value, it does not change with the rotation of the body and the change of the moving distance of the camera. Therefore, the invention takes the height of the head of the human body as a reference point and introduces a function phi₂(. cndot.) represents the vector concatenation of the relative visible lengths of the skeleton segments contained in the human skeleton, i.e., equation (2).

Wherein 11 represents that the human skeleton model of the invention totally comprises 11 segments of human limb skeletons, v_iIs the ith skeleton of human body gesture, V_headIs a head skeleton vector representing the vertex to neck center, and | represents the vector mode, i.e., the length of the head skeleton.

Representing vector stitching. The formula is expressed as V_headFor reference, the length of each limb skeleton model is divided by V_headDie length ofThe visible length of each skeleton relative to the head skeleton is calculated.

In addition, because the direction of the gravitational acceleration is always vertical to the ground, in order to describe the direction of each skeleton segment in the human skeleton relative to the ground, the invention introduces the included angle between the skeleton and the gravitational acceleration. And use phi₃And (c) the vector splicing of each framework and the included angle of the gravity direction is shown, namely formula (3).

The invention describes the angle characteristics of the framework by using the trigonometric function value of the framework and the gravity acceleration direction. In the formula (3), d represents a unit vector, and the direction is the same as the gravity direction.

Calculating cos value of an included angle between each skeleton vector and the gravity direction,

the sin value is calculated. 2 spatial context characteristics contained in the human skeleton, namely the relative visible length V of the skeleton, are extracted through the steps_lAngle V between skeleton and gravity direction_a. Here, the shape feature of the human body gesture skeleton is represented by a bone, which may be V_l∪V_a。

The gesture component outline comprises the class characteristics of the head and the hand of a human body, and in order to obtain the class of the component outline, the invention uses the SSD thought for reference, and uses MobileNet with less parameter quantity to replace a feature extraction network VGGNet in the SSD, so as to construct a gesture component outline feature extraction network GPEN:

GPEN detects and classifies the part contour features on the multi-scale convolution feature map. For each unit on each scale convolution feature map, GPEN has anchor boxes with different scales and length-width ratios to predict the gesture component in the box, and the position of the prediction box and the category confidence of the outlines of different gesture components are generated. Let S be GPEN recognize from imageWherein L represents the position information of the part outline prediction frame, and is composed of the coordinates of the center point of the prediction frame, the width and the height of the prediction frame; c represents a set of confidence levels for predicting the object contours contained in the prediction box into different part contour classes. E.g. c_iRepresenting the confidence that a part-outline belongs to a part-outline of the i-th class, i.e. c_i∈C。

For each part profile p(s)_pE S) with position information of l_pClass confidence set is C_p(ii) a Hypothesis C_pIf the part contour corresponding category with the maximum middle confidence value is M and M is the complete set of gesture part contour categories, the category of p is set as M (M belongs to M) and the confidence value is c_m(c_m∈C_p) At this time s_pHas a characteristic value of_m,c_m). By analogy, the feature value set S for all component contours in an image is (L)_m,C_m). According to a preset confidence coefficient threshold value c_th(the present invention is set to 0.5 below which the part is not considered as being part-contoured) is removed from S_mLower than c_thWhile the elements in S are sorted in descending order of confidence values, they form the final set G of part contours. The following 3 steps were repeated:

1) taking the confidence value c in G_mThe highest component, calculated with the other components in G, respectively, according to equation (4), where J (l)_m,l_other) Representing the degree of overlap of the component with the contours of the other component,/_mAs a positional feature of the contour of the part,/_otherThe position characteristics of the outline of other parts.

2) Identifying overlap threshold for the same part profile as J_thThe present invention is set to 0.5, i.e., overlapping coverage of more than 50% is considered the same part profile, so when J (l)_m,l_other) Higher than J_thWhen it is needed, a_otherCorresponding component feature s_otherAnd deleted from G.

3) When the above operation is completed on the sorted component set G, l is added_mCorresponding component feature s_mDelete from G and output s_mCorresponding to (l)_m,c_m) The value is obtained. m belongs to the category to determine that the part profile belongs to the left-hand profile characteristic S_left(or as a right hand profile feature S_rightOr head contour features S_head)。

Repeating the steps until the set G is empty, and finally obtaining the profile characteristic S of the left-hand part_leftRight hand profile part feature S_rightHead profile part feature S_head. On the basis, the human skeleton feature bone of the gesture and the contour feature S of the left hand part are obtained through a formula (5)_leftRight hand profile part feature S_rightHead profile part feature S_headThe stitching constitutes the spatial context feature F of the gesture. Namely:

G＝φ₄(bone,S_left,S_right,S_head) (5)

(3) a long-time and short-time memory network is introduced to extract time sequence characteristics of skeleton, left hand, right hand and head outlines in the dynamic human body gestures, human body space context characteristics are fused, the gestures are further classified and recognized, and the construction of GRSCTFF is completed;

in dynamic gesture recognition, the gesture type is not only related to the current gesture characteristics, but also to previous gesture characteristics. f. of_clsClassification represents the classification of human body gestures as a gesture classification function, F₀Representing the spatial context characteristic of the body at time 0, F₁Representing spatial context characteristics of the human body at time 1, F_τAnd the gesture type at the current time is obtained according to the formula (6).

classification＝f_cls(F₀,F₁,…,F_τ) (6)

Equation (6) illustrates that in order to accurately identify the current dynamic gesture category, a structure is needed to preserve the spatial context characteristics of the previous gesture. Therefore, the present invention introduces an LSTM network to associate spatial features in the dynamic gesture with temporal order and ultimately complete the dynamic gesture classification.

The invention is mainly characterized in that:

(1) the invention provides a method for describing human body gestures by combining human body skeleton gesture characteristics with gesture component outline characteristics. The gesture of the human body is represented only by using the posture characteristics of the human body skeleton, and the gesture category is easily judged by mistake when the human body is in a shielded state; if only the part contour is used for representing the human body gesture, the apparent characteristics of the human body part, such as color, texture, edge information and the like, are excessively depended, and certain limitation is realized. As shown in formula (5) in the invention content, the technical difficulty of the invention lies in representing the human body integral model by the contour characteristics of the human body skeleton characteristic splicing gesture part, wherein the human body skeleton characteristics are calculated by formula (3), and creatively proposes that the normalized relative length relative to the head to the neck of the human body and the relative angle of the limb relative to the gravity acceleration direction are used as the human body skeleton characteristics; in addition, the contour characteristics of the gesture component are obtained by a formula (4), and the gesture component category identified by the contour detection network is used for representing the local information of the human body, so that the human body model structure is more complete.

(2) The invention cuts the CPM network to construct the key node recognition network KEN of the human skeleton, so that the network KEN has enough real-time performance, can achieve the recognition speed of 15 frames per second in practical test, and has higher recognition precision at the same time, the network model has the advantages of not depending on specific human space constraint, and breaking the limitation that the traditional human gesture recognition network depends on the apparent characteristics of human parts; on the other hand, the gesture component contour feature extraction network GPEN provided by the invention improves the size of the anchoring frame according to the contour of the human body component in the data set, so that the anchoring frame has enough detection capability when corresponding to smaller contours such as hands, heads and the like, the features extracted by the gesture component contour network can well supplement and describe the human body when part of the limb skeleton of the human body is shielded, and meanwhile, the gesture component contour feature extraction network GPEN has the advantages of less network model parameters and high detection speed; and finally, the LSTM network is adopted to extract the time sequence characteristics of the human dynamic gestures, so that time sequence association in the human dynamic gesture performing process is constructed, and the problem that the dynamic continuous gestures are misjudged due to independent human gestures is solved.

(3) The invention designs and realizes the human body dynamic gesture recognition machine GRSCTFF, so that the human body dynamic gesture recognition machine GRSCTFF can accurately recognize the category of the human body dynamic gesture in various complex scenes, and solves the problems that the human body gesture recognition method based on computer vision is easily influenced by illumination, background and gesture dynamic change and the like; the GRSCTFF network model has the characteristics of less parameter quantity, high operation speed and high identification precision.

Drawings

FIG. 1 is a gesture skeletal feature extraction process;

FIG. 2(a) is a human gesture;

FIG. 2(b) is a diagram of a human gesture corresponding to a skeletal key node and a skeletal vector;

FIG. 2(c) is a human gesture corresponding to a gesture component outline;

FIG. 3 is a KEN network architecture;

figure 4 is a GPEN network architecture;

FIG. 5 is a convolution process of a depth separable convolution;

FIG. 6 is a scatter plot of feature size ratios

Fig. 7 is an LSTM network architecture.

Detailed Description

The invention adopts the following technical scheme and implementation steps:

FIG. 2(a) shows a generic human gesture model. In order to recognize the gesture, it is necessary to recognize the human skeleton (b) and its head and contour features (c) of the left and right hands.

The skeleton of the human body can be abstracted into 14 key nodes and connecting lines thereof, as shown in fig. 2 (a). The coordinate sets of these key nodes in FIG. 2(b) are Y, Y₁Denotes No. 1 human body key node, whichThe rest serial numbers of the key nodes of the human body are analogized, and Y is (Y)₁,Y₂,…,Y₁₄). V represents the set of connection dependencies existing between adjacent key nodes in Y, i.e. the human limb skeleton, which is represented by the head skeleton V as shown in FIG. 2(b)_headUpper body skeleton V_upperAnd lower body skeleton V_lowerAnd 3, part composition. Namely:

v is a key node connection (i.e. V ∈ V), and the starting key node and the ending key node are respectively Y_aAnd Y_bThen, then

A skeleton vector contained by the human skeleton is represented. Similar to the key node classification method, the human gesture component mainly includes a head and a hand, wherein the hand includes a left hand and a right hand, as shown in fig. 2 (c). The human body gesture model can be completely described by fusing the gesture part outline shown in fig. 2 with the human body skeleton gesture.

(2) The spatial context feature extraction module is designed and realized in a specific mode that a deep neural network is constructed by adopting a convolution posture machine and a single-shot multi-frame detector technology to extract human body gesture skeleton and part outline features, and the human body gesture skeleton and part outline features are combined into human body spatial context features.

The space context feature extraction module mainly comprises two parts, wherein one part is the design and implementation of a human skeleton key node identification network KEN, and the other part is the design and implementation of a gesture component outline feature extraction network GPEN, and the design and implementation are as follows:

1) design and realization of human skeleton key node identification network KEN:

the conventional CPM outputs 15 hot spot maps while monitoring human activities. Wherein, 14 hot spot graphs correspond to corresponding key nodes of a human body, and the other 1 hot spot graph is a background hot spot graph. The invention uses the CPM idea for reference and supplements the incidence relation among key nodes of the human skeleton at the output end. Meanwhile, in order to support gesture real-time recognition and cut the depth of the CPM, a human body key node extraction network KEN comprising 3 stages is constructed, and FIG. 3 is a network architecture thereof.

In fig. 3, C denotes a convolution layer, K denotes a convolution kernel size, OC denotes the number of output channels, and t denotes a stage. The KEN adopts the first 10 layers of the VGG-19 network as an image feature extraction network to process an input image, and the sizes of convolution kernels from the 1 st layer to the 10 th layer are respectively as follows: 3 × 3 × 64, 3 × 03 × 164, 3 × 23 × 3128, 3 × 43 × 5128, 3 × 63 × 7512, 3 × 83 × 9512, 3 × 3 × 0256, 3 × 3 × 256, and the convolution layers of the 2 nd layer, the 4 th layer, and the 6 th layer are sequentially subjected to maximum pooling, the convolution kernel size of all the maximum pooling is 2 × 2 and the step size is 2, and finally the characteristic x of the image is obtained by the above-mentioned processing_z. The KEN tailors the CPM network depth and implements a classifier comprising 3 stages. Wherein Z is the set of all position coordinates of the human skeleton in the image, and when t is 1, Stage1 uses the image characteristic x_zAs input, a classifier g of stage1 is implemented₁(·)，g₁(. 2) including 2 branches, the convolutional layer parameter of 2 branches is identical, every branch all contains 5 convolutional layers, in which the convolutional kernel size of every layer is respectively 3X 128, 3X 128, 1X 512, 1X 8 according to the precedence order, the first branch outputs joint point Y_kConfidence set b of₁(Y_kZ) (Z ∈ Z), the second branch outputs a set of skeletons L1; when t is 2, stage 2 takes the image feature x_zAnd joint point Y output in stage1_kConfidence set b₁(Y_kZ) (Z ∈ Z) and a set of skeletons L1 as inputs, implementing stage 2 classifier g₂(·)，g₂(. 2) including 2 branches, the convolutional layer parameter of 2 branches is identical, and each branch all contains 7 convolutional layers, and the convolutional kernel size of each layer is respectively 7 × 7 × 128, 7 × 07 × 1128, 7 × 27 × 3128, 7 × 7 × 128, 1 × 1 × 8 according to the precedence order, and the first branch outputs joint point Y_kConfidence set b of₂(Y_kZ) (Z ∈ Z), the second branch outputs a set of skeletons L2; when t is 3, stage 3 takes the imageCharacteristic x_zAnd joint point Y output in stage 2_kConfidence set b₂(Y_kZ) (Z ∈ Z) and a skeleton set L2 as inputs, implementing stage 3 classifier g₃(·)，g₃(g) a build-up layer structure and₂(. o) complete coincidence, the first branch outputting the joint point Y_kConfidence set b of₃(Y_kZ) (Z ∈ Z), the second branch outputs a skeleton set L3.

KEN comprises 3 cost functions which respectively calculate the joint point confidence degree sets b output by the 3 stages₁(Y_k＝z)、b₂(Y_k＝z)、b₃(Y_kZ) with true confidence b_*(Y_kZ), by which the problem of gradient disappearance during network training can be prevented. The total error of the system due to KEN can be calculated according to equation (8).

In the formula, the first step is that,

is the prediction confidence of the jth key node in the human skeleton on z,

is the true confidence of the jth key node in the human skeleton.

The KEN network is trained by using a human body key node data set disclosed by AI Changler as a training sample. In the training of the human skeleton feature extraction network KEN, the batch value is 15; the gradient descent adopts an Adam optimizer, the learning rate of the Adam optimizer is 0.0008, and the exponential decay rate of each 20000 steps is 0.8.

2) Design and realization of a gesture part outline feature extraction network GPEN:

because the labeled data in the data set are relatively less, the phenomenon of overfitting is easily caused by directly using the SSD network for training. In order to relieve the occurrence of the over-fitting phenomenon and reduce the parameter quantity of the network model, the method adopts MobileNet with less parameter quantity to replace a feature extraction network VGGNet in the SSD, and further constructs a gesture component outline feature extraction network GPEN. Fig. 4 is a network structure of GPEN.

In fig. 4, the layer 0 convolutional layer (Conv0) of GPEN uses a conventional convolutional kernel, which has a size of 3 × 3 × 32; the image feature extraction network section in GPEN, i.e., the 1 st to 13 th convolutional layers (Conv1-Conv13) is constructed based on a stacking technique of depth-separable convolutions, each set of depth-separable convolutions including one single-depth convolution kernel and one single-point convolution kernel, the sizes of the single-depth convolution kernels of Conv1-Conv13 are respectively 3 × 3 × 32, 3 × 3 × 64, 3 × 3 × 128, 3 × 3 × 256, 3 × 3 × 512, 3 × 3 × 1024, and the sizes of the single-point convolution kernels of Conv1-Conv13 are respectively 1 × 1 × 32 × 64 × 128, 1 × 1 × 128 × 256, 1 × 1 × 512, 512 × 512, 1 × 1 × 32 × 512, 3, 1 × 1 × 512 × 512, 1 × 1 × 512 × 1024, 1 × 1 × 1024 × 1024. Then, the sizes of convolution kernels of 8 convolutional layers in total, i.e., Conv14_1 to Conv17_2, were 1 × 1 × 1024 × 256, 3 × 3 × 256 × 512, 1 × 1 × 512 × 128, 3 × 3 × 128 × 256, 1 × 1 × 256 × 64, and 3 × 3 × 64 × 128, respectively, in chronological order.

The depth separable convolution in Conv1-Conv13 separates the channel correlation from the spatial correlation and replaces the conventional convolution kernel with a depth separable convolution kernel, thus greatly reducing the amount of parameters in the network, the complete convolution process of which is shown in FIG. 5. Wherein M is the number of image input channels, OC is the number of output channels, K represents the size of the convolution kernel, K × K represents the size of the convolution kernel, and D_F×D_FSize of input feature graph, D_L×D_LIndicating the size of the output signature. The ratio of the characteristic parameters of the depth separable convolution kernel to the conventional convolution kernel is calculated by equation (9).

Because OC is larger when the convolutional network extracts the image characteristics, the term value

And can be ignored. If a common convolution kernel (size 3 × 3) is used, the term values

Has a value of

It can be seen that the depth separable convolution can greatly reduce the feature parameters, thereby preventing the network from being over-fitted.

The loss function of GPEN network training is composed of classification loss and positioning loss, as shown in equation (10).

Wherein L is_confThe method is a classification loss function, and the invention uses a Softmax loss function; l is_LocThe method is a positioning loss function of a prediction box, and the method uses a smooth L1 loss function; alpha is a weight coefficient of positioning loss, and is set to be 1; n is the number of samples of input GPEN. x is the category matching information of the current prediction box; g is the true value of the detection box; the other variables are consistent with GPEN in the invention content, represent the position information (L belongs to L) of the prediction frame of the part outline, and are composed of the coordinate of the center point of the prediction frame, the width and the height of the prediction frame; c represents the confidence (C ∈ C) of predicting the object contour contained in the prediction box into a different part contour class;

before network training, GPEN needs to optimize an SSD anchoring box according to the contour characteristics of hands and heads in human gestures. The scatter plot of the left-right hand and head contour scale proportion of the public traffic police gesture data set labeled video sample adopted by the invention is shown in fig. 6. In fig. 6, the abscissa represents the ratio of the width of the part outline marking frame to the width of the entire image; the ordinate represents the proportion of the height of the part outline marking frame to the height of the whole image. As can be seen from fig. 6, the ratio of the height of the component labeling frame to the height of the original image is less than 0.25, the ratio of the width of the component labeling frame to the width of the original image is less than 0.20, and the normalized dimension of the component labeling frame is between 0.05 and 0.25. In order to train GPEN, the normalized scale value of the anchor box is between 0.05 and 0.3. Both GPEN and original SSD networks contain 6 feature layers including anchor boxes, and the normalized scale of the anchor boxes on each feature layer is shown in the following table:

in the GPEN network training part, the image feature extractor of the gesture component outline feature extraction network is changed from VGGNet to MobileNet during network model design, so that a MobileNet pre-training model provided by Google is directly loaded to GPEN. The GPEN after the pre-training model is changed takes a human body gesture video frame data set as a sample for training, and the batch value is 24. In the training process, the loss function value is continuously reduced through random gradient descent and a back propagation mechanism, so that the position of the anchor frame approaches to the position of a real frame, and the classification confidence coefficient is improved. After 120000 steps of accumulative training, network convergence and system accuracy do not change any more, the model is stored for recognizing and extracting human body gesture part contour characteristics, and the relative length and angle characteristic data of the skeleton in the human body gesture calculated by combining the KEN network are combined to obtain the spatial context characteristics.

(3) And (3) extracting a network by using time sequence characteristic features and classifying dynamic gestures.

According to key nodes output by the KEN and the incidence relation among the nodes, the relative length of each skeleton section in the human skeleton and the included angle between the relative length and the gravity acceleration can be respectively calculated, and the tau-moment human gesture space context feature F can be generated by combining the left hand contour, the right hand contour and the head contour category output by the GPEN_τ。

Obtaining human body gesture space context characteristics F_τThe LSTM network is then used to extract timing characteristics of the dynamic gesture. Fig. 7 shows the architecture of an LSTM network used in the present invention. In the context of figure 7 of the drawings,e_τ-1、h_τ-1and F_τIs the input to the LSTM network. Wherein, F_τThe characteristic value of the synthesis (concat) of the relative length of each skeleton in the human body skeleton at the time tau, the included angle between the relative length and the gravity acceleration, the left hand contour type, the right hand contour type and the head contour type. In addition, when tau is the initial time, the system randomly generates an initial value e₀And h₀，h₀As a time-series characteristic of dynamic gestures, e₀For LSTM network memory preservation, e_τAnd h_τIs the output of the network and is taken as tau>1 time input of the next time of the LSTM network. Wherein "sigmoid", "tanh" and "softmax" denote the activation function, P_τRepresenting network output h at time τ_τThe probability of a dynamic gesture category is obtained by activating the function "softmax". When the network is trained, cross entry function is adopted to calculate network loss, and a truncation back propagation algorithm is adopted to avoid the problem of gradient disappearance in the training.

The invention adopts Xavier to carry out initialization setting on the neurons in the LSTM network and adopts a truncation back propagation algorithm to train the LSTM network. In training, a human body gesture video data set is randomly cut into small pieces of videos with the length of 90 seconds, and 128 small pieces of videos are assembled to form a batch. The learning rate of LSTM is 0.0004 and the gradient descent algorithm employs an Adam optimizer. The LSTM network training is stopped after 50000 steps of accumulated training.

And after the GRSCTFF dynamic gesture recognition machine is built, training by adopting a public data set. The human skeleton key node network KEN adopts a human body key node data set disclosed by AI Challenger as a training sample, and the gesture part outline feature extraction network GPEN adopts a training set of a public traffic police gesture data set for training.

And finally, GRSCTFF adopts a test set of a public traffic police gesture data set to carry out experimental verification, the accuracy of human body gesture recognition is calculated through the editing distance, namely the minimum editing times required by converting the gesture information predicted and recognized by the model into the real labeled gesture information are calculated, and the editing distance is calculated according to the formula (11). Where Accuracy represents the Accuracy, H is the total number of poses in the video, I is the total number of inserted poses in the video, D is the total number of deleted poses in the system, and P is the total number of replacement poses in the system.

Experiments prove that GRSCTFF can quickly and accurately identify human gestures, the accuracy of the system reaches 94.12%, and the system has strong anti-interference capability on light, background and human gesture position changes.

Claims

1. The gesture recognition method with the fused body skeleton and head and hand part profiles is characterized in that:

when gesture interaction is adopted, a universal gesture model fusing the contour characteristics of the human skeleton, the hand and the head part is established;

the gesture space context information is composed of a gesture skeleton configuration and a gesture part outline; the gesture skeleton configuration comprises relative length characteristics of a human skeleton and angle characteristics relative to the direction of gravitational acceleration, and a human skeleton key node extraction network KEN comprising 3 stages is constructed:

setting Z as the set of all position coordinates (i, j) of the human skeleton in the image; using Y as the position of each key node of human skeleton in image_kIndicating that the human skeleton contains a total of 14 key nodes, hence Y_k∈{Y₁,…,Y₁₄}; KEN is composed of a series of multi-class predictors g_tComponents trained to predict the location of each key node in the same image under different receptive fields; specifically, g_t(. cndot.) is a classifier, and subscript T is epsilon {1, …, T } to represent the classified stages, and the receptive field of each stage is different, wherein T is the last stage of the classifier; g_t(. to) predict the point z in the image under the receptive field to belong to the key node Y_kWherein Z ∈ Z, with b (Y)_kZ) represents a confidence value, then b_T(Y_kZ) represents the confidence of the key node of the z coordinate point when the current is in the T stage; these g_t(. h) has the same objective function value, i.e. true confidence; when t is>1 time, g_t(. is a feature value x extracted from an image position z)_zAnd each key node Y_kSplicing functions of the predicted values of the confidence degrees at the moment t-1; after T stages, the position with the highest confidence coefficient is the position of the key node, and argmax represents the key node Y_kWhen the confidence coefficient is the maximum value, a function of the coordinate point z is obtained; namely:

Y_k＝argmax(b_T(Y_k＝z)),k∈{1…14} (1)

calculating the position of each key node in the human skeleton based on the formula (1), and establishing a preliminary human skeleton form;

introducing a function phi by taking the height of the head of a human body as a reference point₂(. 2) represents the vector splicing of the relative visible lengths of the skeleton segments contained in the human skeleton;

wherein 11 represents that the human skeleton model totally comprises 11 segments of human limb skeletons, v_iIs the ith skeleton of human body gesture, V_headIs a head skeleton vector representing the vertex to neck center, | | represents the vector mode, i.e., the length of the head skeleton;

representing vector stitching; the formula is expressed as V_headFor reference, the length of each limb skeleton model is divided by V_headCalculating the visible length of each skeleton relative to the head skeleton;

in addition, since the direction of gravitational acceleration is always perpendicular to the ground, in order to describe each bone in the human skeletonThe direction of the frame section relative to the ground introduces an included angle between the frame and the gravity acceleration; and use phi₃The vector splicing of each framework and the included angle of the gravity direction is shown in the equation (3);

describing the angle characteristics of the framework by using a trigonometric function value of the framework and the gravity acceleration direction; in the formula (3), d represents a unit vector, and the direction is the same as the gravity direction;

calculating the sin value of the product; 2 spatial context characteristics contained in the human skeleton, namely the relative visible length V of the skeleton, are extracted through the steps_lAngle V between skeleton and gravity direction_a(ii) a Here, the shape feature of the human body gesture skeleton is represented by a bone, which may be V_l∪V_a；

Constructing a gesture component outline feature extraction network GPEN:

setting S as a set of component contour characteristic values (L, C) recognized from the image by GPEN, wherein L represents the position information of a component contour prediction frame and consists of the coordinates of the center point of the prediction frame, the width and the height of the prediction frame; c represents a confidence set for predicting the object contour contained in the prediction frame into different part contour categories; c. C_iRepresenting the confidence that a part-outline belongs to a part-outline of the i-th class, i.e. c_i∈C；

For each part profile p there is s_pBelongs to S and has position information of l_pClass confidence set is C_p(ii) a Hypothesis C_pSetting the category of p as M if the corresponding category of the part contour with the maximum middle confidence value is M and M is a complete set of gesture part contour categories, wherein M belongs to M and the confidence value is c_mWherein c is_m∈C_pAt this time s_pHas a characteristic value of_m,c_m) (ii) a By analogy, the feature value set S for all component contours in an image is (L)_m,C_m) (ii) a According to a preset confidence coefficient threshold value c_thConfidence threshold c_thSet to 0.5, below which value c is removed from S and not considered part profile_mLower than c_thThe elements in S are sorted in descending order according to confidence value, and they form the final part outline set G; the following 3 steps were repeated:

1) taking the confidence value c in G_mThe highest component, calculated with the other components in G, respectively, according to equation (4), where J (l)_m,l_other) Representing the degree of overlap of the component with the contours of the other component,/_mAs a positional feature of the contour of the part,/_otherPosition features that are profiles of other components;

2) identifying overlap threshold for the same part profile as J_thSet to 0.5, i.e., overlap coverage over 50%, then consider the same part profile, so when J (l)_m,l_other) Higher than J_thWhen it is needed, a_otherCorresponding component feature s_otherDeleting from G;

3) when the above operation is completed on the sorted component set G, l is added_mCorresponding component feature s_mDelete from G and output s_mCorresponding to (l)_m,c_m) A value; m belongs to the category to determine that the part profile belongs to the left-hand profile characteristic S_left(or as a right hand profile feature S_rightOr head contour features S_head)；

Repeating the steps 1) to 3) until the set G is empty, and finally obtaining the profile characteristic S of the left-hand part_leftRight hand profile part feature S_rightHead profile part feature S_head(ii) a On the basis of the above, throughThe human skeleton characteristic bone of the gesture and the profile characteristic S of the left hand part are expressed in the formula (5)_leftRight hand profile part feature S_rightHead profile part feature S_headSplicing the spatial context characteristics F forming the gesture; namely:

in dynamic gesture recognition, the gesture type is not only related to the current gesture characteristics, but also related to the previous gesture characteristics; f. of_clsClassification represents the classification of human body gestures as a gesture classification function, F₀Representing the spatial context characteristic of the body at time 0, F₁Representing spatial context characteristics of the human body at time 1, F_τThe human body space context characteristics at the time of tau are represented, so that the gesture type of the current time is obtained according to a formula (6);

classification＝f_cls(F₀,F₁,…,F_τ) (6)

2. the method of claim 1, wherein:

the skeleton abstraction of human body is 14 key nodes and their connection lines, and the coordinate sets of these key nodes are Y, Y₁The key node of No. 1 human body is shown, the key nodes of other serial numbers are analogized, and Y is (Y)₁,Y₂,…,Y₁₄) (ii) a V represents the set of connection dependencies existing between adjacent key nodes in Y, namely the human body limb skeleton which is formed by the head skeleton V_headUpper body skeleton V_upperAnd lower body skeleton V_lower3, forming a part; namely:

A skeleton vector contained in the human skeleton is represented; similar to the key node classification method, the human gesture component mainly comprises a head and a hand, wherein the hand comprises a left hand and a right hand, and the human gesture model is completely described by fusing the outline of the gesture component with the posture of a human skeleton;

(2) the design of the space context feature extraction module is realized by adopting a convolution gesture machine and a single-shot multi-frame detector technology to construct a deep neural network to extract the human body gesture skeleton and part outline features, and combining the human body gesture skeleton and part outline features into human body space context features;

the space context feature extraction module comprises two parts, wherein one part is the design and implementation of the human skeleton key node identification network KEN, and the other part is the design and implementation of the gesture component outline feature extraction network GPEN, and the design and implementation are as follows:

outputting 15 hot spot graphs when monitoring human activities; wherein, 14 hotspot graphs correspond to corresponding key nodes of a human body, and the other 1 hotspot graph is a background hotspot graph; the incidence relation among key nodes of the human skeleton is supplemented at the output end; meanwhile, in order to support gesture real-time recognition and cut the CPM depth, a human body key node extraction network KEN comprising 3 stages is constructed;

c represents a convolution layer, K represents the size of a convolution kernel, OC represents the number of output channels, and t represents a stage; the KEN adopts the first 10 layers of the VGG-19 network as an image feature extraction network to process an input image, and the sizes of convolution kernels from the 1 st layer to the 10 th layer are respectively as follows: 3X 64, 3X 03X 164, 3X 23X 3128, 3X 43X 5128, 3X 63X 7512, 3X 83X 9512, 3X 256, and 3X 256, and in addition, in the 2 nd layerThe convolution layers of the 4 th layer and the 6 th layer are followed by maximum value pooling, the size of convolution kernel of all the maximum value pooling is 2 multiplied by 2 and the step length is 2, and finally the characteristic x of the image is obtained through the processing_z(ii) a The KEN cuts the CPM network depth and realizes a classifier comprising 3 stages; wherein Z is the set of all position coordinates of the human skeleton in the image, and when t is 1, Stage1 uses the image feature x_zAs input, a classifier g of stage1 is implemented₁(·)，g₁(. 2) including 2 branches, the convolutional layer parameter of 2 branches is identical, every branch all contains 5 convolutional layers, in which the convolutional kernel size of every layer is respectively 3X 128, 3X 128, 1X 512, 1X 8 according to the precedence order, the first branch outputs joint point Y_kConfidence set b of₁(Y_kZ), Z ∈ Z second branch of which outputs skeleton set L1; when t is 2, stage 2 takes the image feature x_zAnd joint point Y output in stage1_kConfidence set b₁(Y_kZ) and a skeleton set L1 as inputs, implementing stage 2 classifier g₂(·)，g₂(. 2) including 2 branches, the convolutional layer parameter of 2 branches is identical, and each branch all contains 7 convolutional layers, and the convolutional kernel size of each layer is respectively 7 × 7 × 128, 7 × 07 × 1128, 7 × 27 × 3128, 7 × 7 × 128, 1 × 1 × 8 according to the precedence order, and the first branch outputs joint point Y_kConfidence set b of₂(Y_kZ), the second branch outputs a skeleton set L2; when t is 3, stage 3 takes the image feature x_zAnd joint point Y output in stage 2_kConfidence set b₂(Y_kZ) and a skeleton set L2 as inputs, implementing stage 3 classifier g₃(·)，g₃(g) a build-up layer structure and₂(. o) complete coincidence, the first branch outputting the joint point Y_kConfidence set b of₃(Y_kZ), the second branch outputs a skeleton set L3;

KEN comprises 3 cost functions which respectively calculate the joint point confidence degree sets b output by the 3 stages₁(Y_k＝z)、b₂(Y_k＝z)、b₃(Y_kZ) with true confidence b_*(Y_kZ) the euclidean distance between; the total error of the system generated by the KEN is calculated according to a formula (8);

in the formula, the first step is that,

is the prediction confidence of the jth key node in the human skeleton on z,

is the true confidence of the jth key node in the human skeleton;

in the training of the human skeleton feature extraction network KEN, the batch value is 15; the gradient descent adopts an Adam optimizer, the learning rate of the Adam optimizer is 0.0008, and the exponential decay rate of each 20000 steps is 0.8;

constructing a gesture component outline feature extraction network GPEN;

layer 0 convolutional layer Conv0 of GPEN uses convolutional kernels, the size of which is 3 × 3 × 32; the image feature extraction network section in GPEN, i.e., the 1 st to 13 th convolutional layers, i.e., Conv1-Conv13, is constructed based on a stacking technique of depth-separable convolutions, each set of depth-separable convolutions including one single-depth convolution kernel and one single-point convolution kernel, the sizes of the single-depth convolution kernels of Conv1-Conv13 are respectively 3 × 3 × 32, 3 × 3 × 64, 3 × 3 × 128, 3 × 3 × 256, 3 × 3 × 512, 3 × 3 × 1024, and the sizes of the single-point convolution kernels of Conv1-Conv13 are respectively 1 × 1 × 32 × 64, 1 × 64 × 128, 1 × 1 × 128 × 128, 1 × 1 × 128 × 256, 1 × 1 × 512, 512 × 1 × 512, and 3 × 512, 3 × 512, 1 × 1 × 512 × 512, 1 × 1 × 512 × 1024, 1 × 1 × 1024 × 1024; then the sizes of convolution kernels of 8 convolutional layers in total, namely Conv14_1 to Conv17_2, are respectively 1 × 1 × 1024 × 256, 3 × 3 × 256 × 512, 1 × 1 × 512 × 128, 3 × 3 × 128 × 256, 1 × 1 × 256 × 64 and 3 × 3 × 64 × 128 in sequence;

m is the number of image input channels, OC is the number of output channels, K represents the size of the convolution kernel, K multiplied by K represents the size of the convolution kernel, D_F×D_FSize of input feature graph, D_L×D_LRepresenting the size of the output feature map; the characteristic parameter ratio of the depth separable convolution kernel to the traditional convolution kernel is calculated by a formula (9);

the loss function of the GPEN network training is composed of classification loss and positioning loss, and is shown in formula (10);

wherein L is_confIs a classification loss function, and uses a Softmax loss function; l is_LocThe positioning loss function of the prediction box uses smooth L1 loss function; α is a weight coefficient of the positioning loss, set to 1; n is the number of samples of input GPEN; x is the category matching information of the current prediction box; g is the true value of the detection box; l represents the position information of the part contour prediction frame, wherein L belongs to L and consists of the coordinate of the center point of the prediction frame, the width and the height of the prediction frame; c represents the confidence coefficient of predicting the object contour contained in the prediction box into different part contour categories, wherein C belongs to C;

GPEN contains 6 feature layers including anchor boxes, with the normalized scale of the anchor boxes on each feature layer as follows:

the GPEN network training part directly loads a Mobile Net pre-training model provided by Google company to GPEN because an image feature extractor of a gesture component outline feature extraction network is changed from VGGNet to Mobile Net during network model design; the GPEN after the pre-training model is changed takes a human body gesture video frame data set as a sample for training, and the batch value is 24; in the training process, loss function values are continuously reduced through random gradient descent and a back propagation mechanism, so that the position of the anchor frame approaches to the position of a real frame, and the classification confidence coefficient is improved; training until network convergence and system accuracy do not change any more, storing the model for identifying and extracting human body gesture part contour features, and obtaining space context features by combining relative length and angle feature data of a skeleton in a human body gesture calculated by a KEN network;

(3) extracting a network and classifying dynamic gestures according to the time sequence characteristic;

respectively calculating the relative length of each skeleton section in the human skeleton and the included angle between the relative length and the gravity acceleration according to key nodes output by the KEN and the incidence relation among the nodes, and generating a human gesture space context characteristic F at the tau moment by combining the left hand contour, the right hand contour and the head contour type output by the GPEN_τ；

Obtaining human body gesture space context characteristics F_τThen, using the LSTM network to extract the time sequence characteristics of the dynamic gesture; e.g. of the type_τ-1、h_τ-1And F_τIs the input to the LSTM network; wherein, F_τThe characteristic values are synthesized by the relative length of each skeleton in the human body skeleton at the time tau, the included angle between the relative length and the gravitational acceleration of each skeleton, the left hand contour type, the right hand contour type and the head contour type; in addition, when tau is the initial time, the system randomly generates an initial value e₀And h₀，h₀As a time-series characteristic of dynamic gestures, e₀For LSTM network memory preservation, e_τAnd h_τIs the output of the network and is taken as tau>1, inputting the next moment of the LSTM network; wherein "sigmoid", "tanh" and "softmax" denote the activation function, P_τRepresenting network output h at time τ_τObtaining the probability of the dynamic gesture category by activating a function 'softmax';when the network is trained, cross entropy function is adopted to calculate network loss, and a truncation back propagation algorithm is adopted to avoid the problem of gradient disappearance in the training;

initializing and setting neurons in the LSTM network by using Xavier, and training the LSTM network by using a truncation back propagation algorithm; in training, a human body gesture video data set is randomly cut into small pieces of videos with the length of 90 seconds, and 128 small pieces of videos are assembled to form a batch; the learning rate of the LSTM is 0.0004, and an Adam optimizer is adopted in the gradient descent algorithm; stopping LSTM network training after 50000 steps of accumulated training;

and after the GRSCTFF dynamic gesture recognition machine is built, training by adopting a public data set.