CN111209861A

CN111209861A - Dynamic gesture action recognition method based on deep learning

Info

Publication number: CN111209861A
Application number: CN202010011805.1A
Authority: CN
Inventors: 张烨; 陈威慧; 樊一超
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2020-05-29
Anticipated expiration: 2040-01-06
Also published as: CN111209861B

Abstract

A dynamic gesture motion recognition method based on deep learning comprises the following steps: the method comprises the steps of firstly, constructing a gesture joint coordinate recognition network, processing a gesture video by using an improved CPM (continuous phase modulation) model, and outputting gesture joint coordinates under a single viewpoint; collecting single-viewpoint video data; the gesture video sample is collected in a single-view mode, namely a common network camera is used for capturing gesture data of a user from multiple angles, and the gesture data comprises the following steps: (2.1) defining a basic gesture element; (2.2) selecting a gesture joint point; (3) preparing a training sample data set; step three, outputting a gesture Gaussian heat map and gesture joint point coordinates; step four, constructing a gesture sequence recognition network, wherein the specific process of constructing the network model is as follows: (4.1) defining an activation function; (4.2) selecting a loss function; (4.3) establishing a model; and finally, inputting the joint point coordinates obtained in the third step into a standard gesture sequence recognition network to obtain a gesture action sequence.

Description

Dynamic gesture action recognition method based on deep learning

Technical Field

The invention relates to a dynamic gesture action recognition method based on deep learning.

Technical Field

Under the development wave of computer vision, the recognition of human body gesture actions by using a convolutional neural network has become a new research direction. In the aspect of gesture motion recognition, compared with the traditional method, the method based on the convolutional neural network has the advantages of lower cost and time consumption and higher recognition efficiency, saves the steps of gesture segmentation, manual feature extraction and template matching, and reduces the complexity of a model. However, the existing gesture recognition method only recognizes which kind of gesture the static gesture or the dynamic gesture belongs to, and recognizes a single gesture, and does not recognize continuous and temporally overlapped dynamic gestures, because there is no recognition framework for combined continuous actions, the gesture recognition cannot be performed in practical production application.

Disclosure of Invention

The present invention provides a gesture recognition method based on computer vision to overcome the above disadvantages of the prior art.

The method comprises the steps of firstly improving a CPM model to construct a gesture joint point coordinate recognition network model, then collecting a gesture video under a single view point, and then transmitting the collected video into a standard gesture joint point coordinate recognition network to obtain a gesture Gaussian thermal map and joint point coordinates. And inputting the coordinates of the relevant nodes into a standard gesture sequence recognition network to obtain a gesture action sequence, and finally realizing the recognition of continuous actions.

In order to achieve the purpose, the invention adopts the following technical scheme:

a dynamic gesture action recognition method based on deep learning comprises the following steps:

step one, constructing a gesture joint point coordinate identification network;

the invention utilizes an improved CPM model to process a gesture video and output the coordinates of gesture joint points under a single viewpoint, and the realization process comprises the following steps:

(1) selecting a basic network model for gesture joint point estimation;

the invention selects VGG-13 as a basic network model for gesture joint point estimation.

(2) Setting a receptive field;

the size of the receptive field is related to the sliding window of convolution or pooling, and both are considered as a map, compressing the k × k range of pixel values on the n-layer feature map into one pixel on the n + 1-layer feature map, denoted as f_k _sWhere s represents the step size of the sliding window, k represents the size of the convolution kernel or pooling kernel, and the mapping relationship is:

wherein: x is the number ofⁿ，xⁿ⁺¹Characteristic diagrams of the nth layer and the (n + 1) th layer are shown. The basic network structure of the invention is based on VGG-13, and for the first part of VGG-13, two convolutions and one pooling are included, and the three structures form a cascade, so that the mapping process is repeated in the network for many times to form a multi-level mapping. The parameters of the receptive field and convolution kernel or pooling kernel for each link are shown in table 1:

TABLE 1 Acceptor field and convolution kernel parameters corresponding to each layer profile under cascade

RF memory_nIs the receptive field of the nth feature map, K_nSize of convolution or pooling kernel for nth convolution layer, S_nIs K_nThe relationship between the receptive field and the step size and the size of the convolution kernel can be deduced from the receptive field rule in table 1.

The receptive field size of the feature map after the first layer of convolution is the size of the convolution kernel:

RF₁＝K₁(2)

when the step length is 1, the size of the receptive field of the n & gtth characteristic diagram is as follows:

RF_n＝RF_n-1+(K_n-1) (3)

for the case that the step length is not 1, n is more than or equal to 2:

RF_n＝RF_n-1+(K_n-1)×S_n(4)

(3) extracting features;

the invention utilizes a basic network model VGG-13 to extract the characteristics of the image.

Firstly, defining the position coordinate of the p-th joint in the image pixel as Y_pThen, there is,

where the set Z represents the position of all pixels in the image.

Setting P joint points to be predicted, and obtaining the coordinates Y of all the P joint points by the target:

Y＝(Y₁,Y₂,…,Y_p) (6)

from the above relationship, Y is a subset of Z.

Then defining a multi-stage prediction classifier g_t(x) And the method is used for predicting the position of each joint point in each stage. At each stage T ∈ {1,2, … T }, the prediction classifier assigns a point z in the image to Y_pAnd generating a heat map for each gesture joint at each stage, wherein the specific expression is as follows:

when the classifier predicts the gesture joint point location in the first stage, it generates a heat map and corresponding gesture joint point confidence scores:

wherein b is₁ ^p(Y_pZ) is that the classifier predicts the p-th gesture joint in the first stageGesture joint confidence scores while at z position.

For each of the following stages, the confidence score for the p-th gesture joint at z-position may be expressed as:

wherein u, v represent coordinate values of a position z in the image.

In a subsequent stage t (t is more than or equal to 2), based on the heat map of the previous stage and the confidence scores of the gesture joint points, continuously assigning a more accurate position coordinate z to each gesture joint point, wherein the more accurate position z is determined based on the image features extracted by the classifier in the first stage and the context information of the image extracted by the classifier in the previous stage, and similarly, the confidence score of the gesture joint point corresponding to the heat map of the gesture joint point belonging to each stage is still generated by the prediction classifier in each subsequent stage:

wherein psi_t(z,b_t-1) Representing a mapping, X ', between a confidence score and image context information'_zRepresenting the image features extracted by the previous stage around position z.

Under the continuous repetition of the above processes, the position of the p-th gesture joint point is corrected in each stage based on the image context information in the previous stage and the image features extracted in the first stage, and the model finally estimates the more accurate coordinate position of the gesture joint point through the gradual fine adjustment process.

Collecting single-viewpoint video data;

the invention collects gesture video samples in a single-viewpoint mode, namely a common network camera captures gesture data of a user from multiple angles, wherein:

(1) defining basic gesture elements;

the invention redefines the basic action elements recognized visually, and calls the determined specific recognizable basic action elements as basic gesture elements and defines the signs of the basic gesture elements.

(2) Selecting a gesture joint point;

the gesture joint points are identified, the identified joint points are connected and labeled in sequence to form a hand posture skeleton, the hand posture is identified by identifying the hand skeleton posture, and the process is defined as gesture estimation. When the fingers are bent, the fingers are usually divided into three small sections, so that the fingers present different bending degrees, and the connection points between the three sections are just the joint points of the fingers, therefore, the invention selects the point of the fingertip part of each finger as the initial joint point of the finger, then connects the joint points on the three small sections on each finger, then the tail joint point on each finger is connected with one joint point on the wrist, and after the joint points of the model are selected, the joint points of the model are labeled and connected according to a certain sequence to form a gesture framework.

(3) Preparing a training sample data set;

the basis for image or video content identification based on convolutional neural networks is a standardized data set. Therefore, the invention carries out video acquisition on the basic gesture elements under the single viewpoint so as to establish the basic gesture element database.

Meanwhile, an existing large data set is generally divided into a training set, a verification set and a test set. The three subsets have no intersection between every two subsets, the union of the three subsets is a complete set, and the three subsets are independently and identically distributed due to the fact that the three subsets are from the same data set. The verification set and the test set are used for testing the accuracy of the model, and both the verification set and the test set are irrelevant to the gradient descent process during model training, but due to the participation of the verification set, the verification result regulates the iteration number and the learning rate of the model, namely the model has a parameter adjustment process, so that the verification set is considered to participate in the model training.

Step three, outputting a gesture Gaussian heat map and gesture joint point coordinates;

the invention adopts a heat map form to label the real value of the gesture joint point, and simultaneously adopts the heat map as the output of the model, wherein the generated gesture joint point heat map takes a certain point of a pixel area where the joint point is located in an image as the center, takes the specific number of pixel points as the radius, draws a circular area, divides the area where the joint point is located out as a probability area where the joint point appears, the color is deepest in the center of the area, the probability of the joint point at the position is shown to be the maximum, and then the color of the area is gradually lightened from the center to the outside. This color will peak in the center and the image form that becomes lighter around resembles a gaussian image, so the gaussian can be used to generate a heat map for each joint area. The coordinates of the heatmap in the present invention are in the form of (x, y), i.e., a formula with a two-dimensional gaussian function:

in the formula, x₀，y₀Representing the real coordinate value of the gesture joint point; x and y represent coordinate values of pixel points in the heat map area of the gesture joint point;

representing an amplitude value of a two-dimensional Gaussian function; sigma²And the standard deviation of x and y is shown.

For the size of the probability area of the gesture joint heat map, the invention defines the probability area as a circular area with the radius of 1, wherein the given value of the amplitude A of the two-dimensional Gaussian function is 1, and the given value of the sigma of the two-dimensional Gaussian function is 1.5, so that a distribution image of the two-dimensional Gaussian function is generated.

The method comprises the steps of generating a hotspot graph of a two-dimensional Gaussian function distribution form on the basis of an original picture, generating a probability region in Gaussian distribution by the heatmap on the basis of the central coordinate of a gesture joint region, wherein the probability value at the center of the region is the maximum, namely the peak central point of the two-dimensional Gaussian function diffuses towards the periphery, and the probability value is smaller. In a gaussian probability region with a peak point with the maximum probability value as a center, the sum of all points is more than 1, but in the probability region, the sum of the probabilities of all pixel points at the positions where the gesture joint points appear should be 1, for this reason, the function values of all pixel points in the region are summed, and the function value corresponding to each pixel point is divided by the sum of the function values of the pixel points, so that the probability sum of all points is ensured to be 1, and the processing mode is as follows:

in the formula: p (x, y) represents the probability of the processed pixel points existing in the joint points; f (x, y) represents a two-dimensional Gaussian function value corresponding to a pixel point in the probability region; Σ f (x, y) represents the sum of the function values of all pixel points.

In the present invention, these heat maps generated based on two-dimensional gaussian functions are referred to as gaussian heat maps, and at each stage of the model, gaussian heat maps of all joint points are output, i.e. one gaussian heat map is corresponding to each joint point.

Step four, constructing a gesture sequence recognition network;

the specific process of the network model construction is as follows:

(1) defining an activation function;

the number of layers of the recurrent neural network related to the invention is not large, and the problem of gradient disappearance is relatively small under the condition of not deep network layers, so Tanh is adopted as an activation function in the recurrent neural network.

The Tanh activation function is a hyperbolic tangent function, and the expression of Tanh and its derivatives is as follows:

(2) selecting a loss function;

the method comprises the steps of outputting the category of basic gesture elements in the last layer of the network, calculating the probability that gestures in an input video respectively belong to each category by adopting a multi-category Softmax loss function, and outputting the category with the highest probability in each category as a gesture prediction result in the video by a model.

Assuming that x is a set of feature vectors input into the Softmax layer by the recurrent neural network, and W and b are parameters of Softmax, the first step of Softmax is to score each category, calculate the score value Logit of each category:

Logit＝W^Tx+b (15)

next, the score for each category is converted to a respective probability value using Softmax:

where i represents the ith gesture class, eⁱThe score representing the ith gesture.

The model outputs a probability distribution for each gesture class, which is a predicted value and is referred to as q (x), and each gesture also has an actual label, i.e., a true probability distribution, which is referred to as p (x). Since the Softmax function is also called cross-entropy loss function, while cross-entropy describes the distance problem between two probability distributions, it can be defined as:

H(p,q)＝-Σp(x)logq(x) (22)

assuming that p (x) is (a, B, C), q (x) is (u, v, w), and p (x) is the true value, and q (x) is the predicted value, the cross entropy of p (x) is represented by q (x):

H((A,B,C),(u,v,w))＝-(Alogu+Blogv+Clogw) (23)

when the positions of q (x) and p (x) are interchanged, the cross entropy of the two is different. The cross entropy is measured by probability, the more the probability of an event occurs, the smaller the amount of information contained in the event, i.e. the smaller the entropy value, therefore, the closer the predicted probability distribution q (x) is to the real value p (x), the smaller the cross entropy of the two is, which means that the closer the output of the model is to the real value, the more accurate the prediction of the model is.

(3) Establishing a model;

in the model, X ═ X₁,x₂,x₃,...,x_T) The gesture recognition method is characterized in that the gesture recognition method is a video frame which is expanded according to a time sequence, the time sequence frames serve as input of a recurrent neural network, information contained in each frame is joint point coordinate values of each gesture, and the length of the time sequence is set to be T. The hidden state of the first hidden layer is H ═ H₁ ⁽¹⁾,h₂ ⁽¹⁾,...,h_T ⁽¹⁾) Then, for the hidden state of the first hidden layer, there are:

wherein, the hidden state of the first sequence of the first hidden layer is:

for the second hidden layer, the input of the second hidden layer is determined by the hidden state at the previous time and the input of the previous hidden layer that is also in the hidden state at the current time, and the hidden state of the second hidden layer can be expressed as:

wherein, the hidden state of the first sequence of the second hidden layer is:

for the final output as the predicted classification result for each gesture, Y ═ Y (Y)₁,Y₂,Y₃,Y₄,…,Y_n) The method comprises the following steps:

Y_i＝Softmax(Vh_T+c) (28)

where i ═ 1,2,3,4, …, n, U, W, V are parameter matrices used to matrix the hidden states of the input and hidden layers, b, c are offsets, and all parameters are shared at various stages of the network.

And finally, inputting the joint point coordinates obtained in the step three into a standard gesture sequence recognition network to obtain a gesture action sequence.

The invention has the advantages that:

the invention provides a gesture recognition algorithm fusing a recurrent neural network based on a computer vision technology, and the gesture recognition algorithm is used for recognizing the gesture actions of staff in the production process. The outstanding characteristics are that: aiming at the problem that continuous complex actions are difficult to identify through a computer vision technology in actual production, a CPM model is improved, a gesture joint coordinate identification network model is established to obtain gesture joint coordinates of gesture video samples collected under a single viewpoint, the gesture joint coordinates are input into a corrected standard gesture sequence identification network, a gesture action sequence is obtained, and continuous actions are identified.

Drawings

FIG. 1 is a model structure of a VGG-13 of the present invention;

FIG. 2 is a schematic diagram of 21 selected gesture joints according to the present invention;

FIG. 3 is a schematic diagram of the gesture joint point labels and skeleton of the present invention;

4 a-4 e are screenshots of video samples of 5 basic gesture elements of the present invention; where fig. 4a is a hands-free movement, fig. 4b is a release or placement, fig. 4c is a rotation, fig. 4d is a load movement, fig. 4e is a grasping;

FIG. 5 is a two-dimensional Gaussian function distribution plot of the present invention;

FIG. 6 is a graph of Tanh activation function and its derivative function according to the present invention;

FIG. 7 is a schematic diagram of a recurrent neural network architecture of the present invention;

FIG. 8 is a schematic diagram of a recurrent neural network structure for five gesture classes in accordance with the present invention;

FIG. 9 is a gradient descent process of the minimization of loss function of the present invention;

FIG. 10 is a graph of the accuracy rate of the single viewpoint model for five basic gesture element recognition;

FIG. 11 is a flowchart of the deep learning based dynamic gesture recognition method of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

Based on the problems, the invention provides a gesture action recognition method based on computer vision, which comprises the steps of firstly improving a CPM model to construct a gesture joint point coordinate recognition network model, then collecting a gesture video under a single view point, and then transmitting the collected video into a standard gesture joint point coordinate recognition network to obtain a gesture Gaussian heat map and joint point coordinates. And inputting the coordinates of the relevant nodes into a standard gesture sequence recognition network to obtain a gesture action sequence, and finally realizing the recognition of continuous actions.

In order to verify the feasibility and the superiority of the method provided by the invention, five basic gestures are selected for verification and test, and the method comprises the following steps:

step one, constructing a gesture joint point coordinate identification network;

(1) selecting a basic network model for gesture joint point estimation;

the method selects VGG-13 as a basic network model for gesture joint point estimation, wherein the VGG-13 is composed of 5 groups of convolution groups, 5 pooling groups, 3 full connections and 1 softmax classification layer.

(2) Setting a receptive field;

the size of the receptive field is related to the sliding window of convolution or pooling, and both are considered as a map, compressing the k × k range of pixel values on the n-layer feature map into one pixel on the n + 1-layer feature map, denoted as f_ksWhere s represents the step size of the sliding window, k represents the size of the convolution kernel or pooling kernel, and the mapping relationship is:

wherein: x is the number ofⁿ，xⁿ⁺¹Characteristic diagram of the n-th layer and the n + 1-th layer。

The basic network structure of the invention is based on VGG-13, and for the first part of VGG-13, two convolutions and one pooling are included, and the three structures form a cascade, so that the mapping process is repeated in the network for many times to form a multi-level mapping. Considering a 6 × 6 region of an original image, for the first design process, the number of layers of convolution is two, the convolution kernel size of each layer is 3 × 3, the step size is 1, the pooling layer is 1, the pooling kernel size is 2 × 2, and the step size is 2. For the feature map output by the first convolution layer, since the size of the convolution kernel is 3 × 3, the receptive field of a pixel point in the feature map on the original image is 3 × 3. For the feature map output by the second convolutional layer, the size of the convolutional kernel of the layer is still 3 × 3, then the receptive field of the pixel point on the second feature map on the first feature map is also 3 × 3, at this time, the 3 × 3 area on the first feature map needs to be pushed back to the original image, and according to the relationship between the receptive fields of the first layer and the original image, the 3 × 3 area of the first layer feature map can be intuitively pushed out from the image to correspond to the 5 × 5 area of the original image, that is, the receptive field of the feature map output by the second convolutional layer on the original image is 5 × 5. For the feature map of the last pooling layer, after pooling, a single pixel is output, and then the corresponding receptive field of the feature map on the second feature map is 2 × 2, similarly, the receptive field of the region corresponding to the first feature map is 4 × 4, and after reverse pushing again, the corresponding receptive field on the original image should be 6 × 6, which means that the receptive field of the feature map output by the last pooling layer corresponding to the original image is 6 × 6. The parameters of the receptive field and convolution kernel or pooling kernel of each link are shown in table 1, and the receptive field of the original image is 1 × 1:

RF₁＝K₁(2)

RF_n＝RF_n-1+(K_n-1) (3)

for the case that the step length is not 1, n is more than or equal to 2:

RF_n＝RF_n-1+(K_n-1)×S_n(4)

if the design of the cascade structure is changed into a single convolution layer, the equivalent receptive field can also be achieved, the size of the convolution kernel at this time is 6 × 6, the step length is 1, and according to the formula (2), the receptive field of the output feature map after the convolution of the first layer is equal to the size of the convolution kernel, namely 6 × 6. The VGG-13 is selected as the basic network structure in the invention, because the utilization of the receptive field structure by the VGG-13, namely, two convolutions and a pooled cascade structure are used to replace a convolution of 6 x 6, the following advantages are achieved: 1) reducing the network parameters; 2) the nonlinear structure of the network is reinforced.

(3) Extracting features;

where the set Z represents the position of all pixels in the image.

Y＝(Y₁,Y₂,…,Y_p) (6)

from the above relationship, Y is a subset of Z.

wherein b is₁ ^p(Y_pZ) is the gesture joint confidence score for the classifier when predicting the pth gesture joint at the z position in the first stage.

wherein u, v represent coordinate values of a position z in the image.

wherein psi_t(z,b_t-1) Representing a mapping between a confidence score and image context information, X_z' represents the image features extracted by the previous stage around position z.

Collecting single-viewpoint video data;

(1) defining basic gesture elements;

the invention adjusts basic action elements of visual recognition on the basis of a model method, eighteen kinds of kinematical elements and the like, redefines action recognition elements, determines 5 specifically recognizable basic action elements, is called as basic gesture elements, namely, Empty hand movement, load movement, rotation, grabbing, releasing or placing, defines symbols thereof, and respectively represents Empty Move, Turn, Grasp and Release, and is specifically shown in table 2:

TABLE 2 basic gesture element Table

(2) Selecting a gesture joint point;

the invention realizes the recognition of the posture of the hand by recognizing gesture joint points and connecting the recognized joint points in sequence to form a skeleton of the posture of the hand, and defines the process as gesture estimation.

When the fingers are bent, the fingers can be seen to be divided into three small sections, so that the fingers present different bending degrees, and the connection points between the three sections are just the joint points of the fingers, therefore, the invention selects the point of the fingertip part of each finger as the initial joint point of the finger, then connects the joint points on the three small sections on each finger, and finally, the tail joint point on each finger is connected with one joint point on the wrist to form the skeleton of the hand posture, namely, 21 gesture joint points are selected in total.

After the joint points of the model are selected, the joint points of the model are labeled and connected according to a certain sequence to form a gesture skeleton. The joint point at the wrist is the final connection point of each finger, so the joint point is marked as number 1 as the starting point of the gesture joint point. And then according to the spatial distance of the joint points, marking the four joint points of the thumb as 2,3,4 and 5 from bottom to top in sequence, namely, the finger tip is the tail end of each finger, and similarly, marking each finger in sequence from bottom to top.

(3) Preparing a training sample data set;

the basis for image or video content identification based on convolutional neural networks is a standardized data set. Since the invention needs to recognize specific 5 basic gesture elements, a sample data set of short video gesture elements with 5 basic gesture elements as the standard is established.

Video acquisition is carried out on 5 basic gesture elements under a single view point, 500 short videos of 1-2 seconds are acquired by each gesture and are completed by 20 different people, each person shoots 50 short videos by each gesture, and 5000 gesture short videos are obtained in total so as to establish a basic gesture element database.

For an existing large data set, if training of a supervised learning model is to be completed and accuracy of the supervised learning model is to be tested, the large data set is usually divided into a training set, a verification set and a test set according to a certain proportion, such as 8:1: 1. The three subsets have no intersection between every two subsets, the union of the three subsets is a complete set, and the three subsets are independently and identically distributed due to the fact that the three subsets are from the same data set. Although the verification set and the test set are both used for testing the accuracy of the model and are not related to the gradient descent process during model training, due to the participation of the verification set, the verification result regulates the iteration number and the learning rate of the model, namely the model has a parameter adjustment process, so that the verification set is considered to participate in the training of the model.

Based on the division rule of the data set, under the condition that the samples are independently and uniformly distributed, 5000 video samples are divided into a training set, a verification set and a test set according to the ratio of 8:1:1 in a uniform random sampling mode. The sample compositions of the divided training set, validation set and test set are shown in table 3, table 4 and table 5 below.

TABLE 3 basic gesture element training set sample composition

TABLE 4 basic gesture element verification set sample composition

TABLE 5 basic gesture element test set sample composition

Step four, constructing a gesture sequence recognition network;

the specific process of the network model construction is as follows:

(1) defining an activation function;

(2) selecting a loss function;

Logit＝W^Tx+b (15)

next, Softmax converts the score of each category into a respective probability value, and assuming that the scores of the five gesture categories are (c, d, e, f, g), the formula for Softmax converting the scores into the probability values can be expressed as:

where i represents the ith gesture class, eⁱThe score representing the ith gesture. The probabilities for the five gesture categories may be expressed as:

the model thus far outputs a probability distribution of five gesture classes, which is a predicted value and is referred to as q (x), and the gesture also carries an actual label, i.e. a true probability distribution, which is referred to as p (x). Since the Softmax function is also called cross-entropy loss function, while cross-entropy describes the distance problem between two probability distributions, it can be defined as:

H(p,q)＝-∑p(x)logq(x) (22)

H((A,B,C),(u,v,w))＝-(Alogu+Blogv+Clogw) (23)

(3) Establishing a model;

wherein, the hidden state of the first sequence of the first hidden layer is:

wherein, the hidden state of the first sequence of the second hidden layer is:

for the final output as the predicted classification result of five gestures, Y ═ Y (Y)₁,Y₂,Y₃,Y₄,Y₅) The method comprises the following steps:

Y_i＝Softmax(Vh_T+c) (28)

where i ═ 1,2,3,4,5), U, W, V are parameter matrices used to matrix transform the hidden states of the input and hidden layers, b, c are offsets, and all parameters are shared at various stages of the network.

(4) Updating a model based on a gradient descent method;

the neural network reversely propagates the loss function of the output layer back to the network by utilizing a gradient descent algorithm, and the contribution rate of the parameters to the loss is obtained, so that the parameters in the network are updated layer by layer. The gradient is the derivative in the differentiation, and the parameters of the loss function in the actual model are multivariate, so the partial derivative needs to be calculated on the parameters of the multivariate function, and the gradient is defined as:

for the minimization of the optimization problem, the principle of the gradient descent method is the loss function J (θ)₁,θ₂,…,θ_n) If the parameter is updated by a step size, which is also called a learning rate, in a direction in which the gradient of one parameter decreases the fastest, and the value of the loss function decreases the fastest, the parameter update process can be represented as the following process:

1) one direction of gradient descent is selected as the direction of the minimization loss function. The selected fastest gradient descent direction is the gradient maximum direction of a certain parameter:

2) the step size of the gradient descent, i.e., the learning rate η, is selected.

3) And (3) adjusting and updating the weight:

θ_i←θ_i-η·Δ_θJ(θ_i) (31)

the gradient is propagated forwards layer by layer according to the processes to form a chain type derivation process, each layer of parameters are updated according to the three steps each time until the model training is finished, and the optimal solution is found.

(5) Training a model;

the invention inputs a video sequence, wherein the video sequence is a frame sequence arranged according to a time sequence, so that the input in each state is a video frame input at each moment. For a frame sequence of time length T, there is a loss function L at each instant^(t)Then the sum of the losses at all times constitutes the total loss function:

the input video is then predictively classified to be as consistent as possible with the given true label, and therefore a process is performed to bring the predicted value as close as possible to the true value, i.e. to minimize the loss function. Parameters in the network are updated in order to minimize the loss function. The output of each time sequence is o^(t)Loss of L^(t)Is formed by^(t)Converted by the Softmax function, therefore, during the gradient back propagation of the loss function, the output o needs to be processed^(t)The gradient is calculated by parameters V and c contained in the formula (1), and the parameters are respectively:

for the loss function of a single sequence, only the gradient of the parameters V and c at the moment is needed to be calculated, the gradients of the parameters W, U and b are all related to the gradient of the hidden layer, and as can be seen from the structure of the recurrent neural network, the gradient of the hidden layer at the moment t is not only related to the loss function of the sequence at the current momentIt also relates to the loss function at time t + 1. Then the gradient of the hidden layer at time t is first defined, denoted as δ: (^t)：

Then, because the gradient of the hidden layer at time t is determined by two time loss functions, the true gradient is the sum of the partial derivative of the hidden layer by the time loss function at time t and the partial derivative of the hidden layer by the time loss function at time t +1, that is:

for the last frame sequence T, since this sequence is already in the end phase. The gradient of its hidden layer is no longer affected by the gradient of the loss function at the next instant, the gradient of the last sequence can be expressed as:

the gradient calculation can then be performed on the parameter W, U, b, for W, there is a gradient:

for U, the gradient is:

for b, there is a gradient:

by repeating the back propagation process, the parameter values are continuously updated, the purpose of loss function optimization is achieved, the model is finally converged, and a better gesture classification accuracy is achieved.

(6) Analyzing the experimental result;

the experimental development environment of the present invention is shown in tables 6 and 7 below, where table 6 lists the hardware environment of the experimental computer, table 7 lists the experimental development environment including specific contents such as development language and development framework, and table 8 lists the parameters of the model.

TABLE 6 Experimental computer configuration

TABLE 7 Experimental development Environment

TABLE 8 training parameters

Training video samples collected under a single viewpoint, dividing video data under the single viewpoint according to the proportion of 8:1:1 of the training set, the verification set and the test set mentioned above, wherein in the invention, labels of 5 gesture samples are set as: the method comprises the steps of moving by hands, moving by load, rotating, grabbing and releasing, then training a model according to parameters set in a table 8, wherein the initial learning rate is 0.001, the learning attenuation rate is 0.94, a gradient descent method is used for carrying out back propagation training, the parameters after model training are closer to the real condition along with the increase of training iteration times, the learning rate is attenuated at the moment, and the minimum learning rate after attenuation is 0.0001. The size of the read video frame during training is 408 multiplied by 720, and the video length is between 1 and 2 seconds, so that the length of the video frame read each time is indefinite, and the value of the loss function is output once after one-step iteration is completed. With the training, the loss function will continuously decrease, the accuracy of the model will continuously increase, then the model will tend to be stable, and finally the convergence is reached.

The invention has the advantages that:

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A dynamic gesture action recognition method based on deep learning comprises the following steps:

step one, constructing a gesture joint point coordinate identification network;

the method comprises the following steps of processing a gesture video by using an improved CPM model, and outputting gesture joint point coordinates under a single view point, wherein the implementation process comprises the following steps:

(1.1) selecting a base network model for gesture joint point estimation;

selecting VGG-13 as a basic network model for gesture joint point estimation;

(1.2) setting a receptive field;

wherein: x is the number ofⁿ，xⁿ⁺¹Characteristic diagrams of the nth layer and the (n + 1) th layer are shown;

the basic network structure is based on VGG-13, and for the first part of the VGG-13, two convolutions and a pooling are included, and the three structures form a cascade connection, so that the mapping process is repeated in the network for multiple times to form a multi-level mapping; the parameters of the receptive field and convolution kernel or pooling kernel for each link are shown in table 1:

RF memory_nIs the receptive field of the nth feature map, K_nSize of convolution or pooling kernel for nth convolution layer, S_nIs K_nThe relationship between the receptive field and the step length and the size of the convolution kernel can be deduced according to the receptive field rule in the table 1;

RF₁＝K₁(2)

RF_n＝RF_n-1+(K_n-1) (3)

for the case that the step length is not 1, n is more than or equal to 2:

RF_n＝RF_n-1+(K_n-1)×S_n(4)

(1.3) extracting features;

extracting the features of the image by using a basic network model VGG-13;

wherein the set Z represents the positions of all pixels in the image;

Y＝(Y₁,Y₂,…,Y_p) (6)

from the above relationship, Y is a subset of Z;

then defining a multi-stage prediction classifier g_t(x) The system is used for predicting the position of each joint point in each stage; at each stage T ∈ {1,2, … T }, the prediction classifier assigns a point z in the image to Y_pAnd generating a heat map for each gesture joint at each stage, wherein the specific expression is as follows:

wherein b is₁ ^p(Y_pZ) is the gesture joint confidence score for the classifier when predicting the pth gesture joint at the z position in the first stage;

wherein u and v represent coordinate values of a certain position z in the image;

wherein psi_t(z,b_t-1) Representing a mapping, X ', between a confidence score and image context information'_zRepresenting the image features extracted by the previous stage around the position z;

under the continuous repetition of the processes, the position of the p-th gesture joint point is corrected in each stage based on the image context information in the previous stage and the image characteristics extracted in the first stage, and the model finally estimates a more accurate coordinate position of the gesture joint point through the gradual fine adjustment process;

collecting single-viewpoint video data;

the gesture video samples are collected in a single-view mode, namely, a common network camera is used for capturing gesture data of a user from multiple angles, wherein:

(2.1) defining a basic gesture element;

redefining the basic action elements recognized visually, and calling the determined specific recognizable basic action elements as basic gesture elements and defining the signs of the basic action elements;

(2.2) selecting a gesture joint point;

the gesture joint points are identified, the identified joint points are connected and labeled according to the sequence to form a hand gesture framework, the gesture of the hand is identified by identifying the gesture of the hand framework, and the process is defined as gesture estimation; when the fingers are bent, the fingers are usually divided into three small sections, so that the fingers present different bending degrees, and the connection points between the three sections are just the joint points of the fingers, therefore, the point of the fingertip part of each finger is selected as the initial joint point of the finger, then the joint points on the three small sections on each finger are connected, then the tail joint point on each finger is connected with one joint point on the wrist, after the joint points of the model are selected, the joint points of the model are labeled and connected according to a certain sequence to form a gesture skeleton;

(3) preparing a training sample data set;

the basis of the identification of image or video content based on a convolutional neural network is a standardized data set; therefore, video acquisition is carried out on the basic gesture elements under the single view point so as to establish a basic gesture element database;

meanwhile, for an existing large data set, the existing large data set is usually divided into a training set, a verification set and a test set; the three subsets have no intersection between every two subsets, the union of the three subsets is a complete set, and the three subsets are independently and identically distributed because of being from the same data set; the verification set and the test set are used for testing the accuracy of the model, and both the verification set and the test set are irrelevant to the gradient descent process during model training, but due to the participation of the verification set, the verification result regulates the iteration number and the learning rate of the model, namely the model has a parameter adjustment process, so that the verification set is considered to participate in the model training;

marking the real value of the gesture joint point in a heat map form, and simultaneously, outputting a heat map as a model, wherein the generated gesture joint point heat map takes a certain point of a pixel area where the joint point is located in an image as a center, takes the specific number of pixel points as a radius, draws a circular area, divides the area where the joint point is located out as a probability area where the joint point appears, the color of the center of the area is the deepest, the probability of the joint point at the position is shown to be the largest, and then the color of the area gradually becomes lighter from the center outwards; the color will peak at the center, and the image form with gradually lighter periphery has similarity with the Gaussian function image, so the Gaussian function can be used to generate the heat map of each joint point region; the coordinates of the heatmap are in the form of (x, y), i.e., a formula with a two-dimensional gaussian function:

representing an amplitude value of a two-dimensional Gaussian function; sigma²Represents the standard deviation of x, y;

for the size of a probability area of the gesture joint heat map, defining the probability area as a circular area with the radius of 1, wherein for the amplitude A of a two-dimensional Gaussian function, the given value is 1, and the given value is 1.5, a distribution image of the two-dimensional Gaussian function is generated;

generating a hotspot graph of a two-dimensional Gaussian function distribution form on the basis of an original picture, generating a probability area in Gaussian distribution by the heatmap based on the central coordinate of a gesture joint point area, wherein the probability value at the center of the area is the maximum, namely the peak central point of the two-dimensional Gaussian function diffuses towards the periphery, and the probability value is smaller; in a gaussian probability region with a peak point with the maximum probability value as a center, the sum of all points is more than 1, but in the probability region, the sum of the probabilities of all pixel points at the positions where the gesture joint points appear should be 1, for this reason, the function values of all pixel points in the region are summed, and the function value corresponding to each pixel point is divided by the sum of the function values of the pixel points, so that the probability sum of all points is ensured to be 1, and the processing mode is as follows:

in the formula: p (x, y) represents the probability of the processed pixel points existing in the joint points; f (x, y) represents a two-dimensional Gaussian function value corresponding to a pixel point in the probability region; Σ f (x, y) represents the sum of the function values of all the pixel points;

the heat maps generated based on the two-dimensional Gaussian functions are called Gaussian heat maps, and the Gaussian heat maps of all joint points are output at each stage of the model, namely each joint point corresponds to one Gaussian heat map;

step four, constructing a gesture sequence recognition network;

the specific process of the network model construction is as follows:

(4.1) defining an activation function;

because the number of layers of the involved recurrent neural network is not large, the problem of gradient disappearance is relatively small under the condition of not deep network layers, and Tanh is adopted as an activation function in the recurrent neural network;

(4.2) selecting a loss function;

the method comprises the steps that classes of basic gesture elements need to be output in the last layer of a network, the probability that gestures in an input video belong to each class is calculated by adopting a multi-class Softmax loss function, and finally a model outputs the class with the highest probability in the class as a gesture prediction result in the video;

Logit＝W^Tx+b (15)

where i represents the ith gesture class, eⁱScore value representing ith gesture；

The model outputs the probability distribution of each gesture category, the probability distribution is a predicted value and is called q (x), and each gesture also has an actual label, namely a real probability distribution and is called p (x); since the Softmax function is also called cross-entropy loss function, while cross-entropy describes the distance problem between two probability distributions, it can be defined as:

H(p,q)＝-∑p(x)logq(x) (22)

H((A,B,C),(u,v,w))＝-(Alogu+Blogv+Clogw) (23)

when the positions of q (x) and p (x) are interchanged, the cross entropy of the two is different; the cross entropy is used for measuring the occurrence probability of an event through probability, the greater the occurrence probability of an event is, the smaller the information content contained in the event is, namely the entropy value is smaller, therefore, when the predicted probability distribution q (x) is closer to the real value p (x), the cross entropy of the predicted probability distribution q (x) and the real value p (x) is smaller, which means that the output of the model is closer to the real value, and the prediction of the model is more accurate;

(4.3) establishing a model;

in the model, X ═ X₁,x₂,x₃,...,x_T) The gesture recognition method comprises the following steps that video frames are expanded according to a time sequence, the time sequence frames serve as input of a recurrent neural network, information contained in each frame is a joint coordinate value of each gesture, and the length of the time sequence is set to be T; the hidden state of the first hidden layer is H ═ H₁ ⁽¹⁾,h₂ ⁽¹⁾,...,h_T ⁽¹⁾) Then, for the hidden state of the first hidden layer, there are:

wherein, the hidden state of the first sequence of the first hidden layer is:

wherein, the hidden state of the first sequence of the second hidden layer is:

Y_i＝Softmax(Vh_T+c) (28)

where i ═ 1,2,3,4, …, n), U, W, V are parameter matrices used to matrix transform hidden states of the input and hidden layers, b, c are offsets, and all parameters are shared at various stages of the network;

and finally, inputting the joint point coordinates obtained in the third step into a standard gesture sequence recognition network to obtain a gesture action sequence.