CN111209861A - Dynamic gesture action recognition method based on deep learning - Google Patents

Dynamic gesture action recognition method based on deep learning Download PDF

Info

Publication number
CN111209861A
CN111209861A CN202010011805.1A CN202010011805A CN111209861A CN 111209861 A CN111209861 A CN 111209861A CN 202010011805 A CN202010011805 A CN 202010011805A CN 111209861 A CN111209861 A CN 111209861A
Authority
CN
China
Prior art keywords
gesture
joint
probability
joint point
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010011805.1A
Other languages
Chinese (zh)
Other versions
CN111209861B (en
Inventor
张烨
陈威慧
樊一超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202010011805.1A priority Critical patent/CN111209861B/en
Publication of CN111209861A publication Critical patent/CN111209861A/en
Application granted granted Critical
Publication of CN111209861B publication Critical patent/CN111209861B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

A dynamic gesture motion recognition method based on deep learning comprises the following steps: the method comprises the steps of firstly, constructing a gesture joint coordinate recognition network, processing a gesture video by using an improved CPM (continuous phase modulation) model, and outputting gesture joint coordinates under a single viewpoint; collecting single-viewpoint video data; the gesture video sample is collected in a single-view mode, namely a common network camera is used for capturing gesture data of a user from multiple angles, and the gesture data comprises the following steps: (2.1) defining a basic gesture element; (2.2) selecting a gesture joint point; (3) preparing a training sample data set; step three, outputting a gesture Gaussian heat map and gesture joint point coordinates; step four, constructing a gesture sequence recognition network, wherein the specific process of constructing the network model is as follows: (4.1) defining an activation function; (4.2) selecting a loss function; (4.3) establishing a model; and finally, inputting the joint point coordinates obtained in the third step into a standard gesture sequence recognition network to obtain a gesture action sequence.

Description

Dynamic gesture action recognition method based on deep learning
Technical Field
The invention relates to a dynamic gesture action recognition method based on deep learning.
Technical Field
Under the development wave of computer vision, the recognition of human body gesture actions by using a convolutional neural network has become a new research direction. In the aspect of gesture motion recognition, compared with the traditional method, the method based on the convolutional neural network has the advantages of lower cost and time consumption and higher recognition efficiency, saves the steps of gesture segmentation, manual feature extraction and template matching, and reduces the complexity of a model. However, the existing gesture recognition method only recognizes which kind of gesture the static gesture or the dynamic gesture belongs to, and recognizes a single gesture, and does not recognize continuous and temporally overlapped dynamic gestures, because there is no recognition framework for combined continuous actions, the gesture recognition cannot be performed in practical production application.
Disclosure of Invention
The present invention provides a gesture recognition method based on computer vision to overcome the above disadvantages of the prior art.
The method comprises the steps of firstly improving a CPM model to construct a gesture joint point coordinate recognition network model, then collecting a gesture video under a single view point, and then transmitting the collected video into a standard gesture joint point coordinate recognition network to obtain a gesture Gaussian thermal map and joint point coordinates. And inputting the coordinates of the relevant nodes into a standard gesture sequence recognition network to obtain a gesture action sequence, and finally realizing the recognition of continuous actions.
In order to achieve the purpose, the invention adopts the following technical scheme:
a dynamic gesture action recognition method based on deep learning comprises the following steps:
step one, constructing a gesture joint point coordinate identification network;
the invention utilizes an improved CPM model to process a gesture video and output the coordinates of gesture joint points under a single viewpoint, and the realization process comprises the following steps:
(1) selecting a basic network model for gesture joint point estimation;
the invention selects VGG-13 as a basic network model for gesture joint point estimation.
(2) Setting a receptive field;
the size of the receptive field is related to the sliding window of convolution or pooling, and both are considered as a map, compressing the k × k range of pixel values on the n-layer feature map into one pixel on the n + 1-layer feature map, denoted as fk sWhere s represents the step size of the sliding window, k represents the size of the convolution kernel or pooling kernel, and the mapping relationship is:
Figure BDA0002356368640000011
wherein: x is the number ofn,xn+1Characteristic diagrams of the nth layer and the (n + 1) th layer are shown. The basic network structure of the invention is based on VGG-13, and for the first part of VGG-13, two convolutions and one pooling are included, and the three structures form a cascade, so that the mapping process is repeated in the network for many times to form a multi-level mapping. The parameters of the receptive field and convolution kernel or pooling kernel for each link are shown in table 1:
TABLE 1 Acceptor field and convolution kernel parameters corresponding to each layer profile under cascade
Figure BDA0002356368640000021
RF memorynIs the receptive field of the nth feature map, KnSize of convolution or pooling kernel for nth convolution layer, SnIs KnThe relationship between the receptive field and the step size and the size of the convolution kernel can be deduced from the receptive field rule in table 1.
The receptive field size of the feature map after the first layer of convolution is the size of the convolution kernel:
RF1=K1(2)
when the step length is 1, the size of the receptive field of the n & gtth characteristic diagram is as follows:
RFn=RFn-1+(Kn-1) (3)
for the case that the step length is not 1, n is more than or equal to 2:
RFn=RFn-1+(Kn-1)×Sn(4)
(3) extracting features;
the invention utilizes a basic network model VGG-13 to extract the characteristics of the image.
Firstly, defining the position coordinate of the p-th joint in the image pixel as YpThen, there is,
Figure BDA0002356368640000022
where the set Z represents the position of all pixels in the image.
Setting P joint points to be predicted, and obtaining the coordinates Y of all the P joint points by the target:
Y=(Y1,Y2,…,Yp) (6)
from the above relationship, Y is a subset of Z.
Then defining a multi-stage prediction classifier gt(x) And the method is used for predicting the position of each joint point in each stage. At each stage T ∈ {1,2, … T }, the prediction classifier assigns a point z in the image to YpAnd generating a heat map for each gesture joint at each stage, wherein the specific expression is as follows:
Figure BDA0002356368640000023
when the classifier predicts the gesture joint point location in the first stage, it generates a heat map and corresponding gesture joint point confidence scores:
Figure BDA0002356368640000024
wherein b is1 p(YpZ) is that the classifier predicts the p-th gesture joint in the first stageGesture joint confidence scores while at z position.
For each of the following stages, the confidence score for the p-th gesture joint at z-position may be expressed as:
Figure BDA0002356368640000031
wherein u, v represent coordinate values of a position z in the image.
In a subsequent stage t (t is more than or equal to 2), based on the heat map of the previous stage and the confidence scores of the gesture joint points, continuously assigning a more accurate position coordinate z to each gesture joint point, wherein the more accurate position z is determined based on the image features extracted by the classifier in the first stage and the context information of the image extracted by the classifier in the previous stage, and similarly, the confidence score of the gesture joint point corresponding to the heat map of the gesture joint point belonging to each stage is still generated by the prediction classifier in each subsequent stage:
Figure BDA0002356368640000032
wherein psit(z,bt-1) Representing a mapping, X ', between a confidence score and image context information'zRepresenting the image features extracted by the previous stage around position z.
Under the continuous repetition of the above processes, the position of the p-th gesture joint point is corrected in each stage based on the image context information in the previous stage and the image features extracted in the first stage, and the model finally estimates the more accurate coordinate position of the gesture joint point through the gradual fine adjustment process.
Collecting single-viewpoint video data;
the invention collects gesture video samples in a single-viewpoint mode, namely a common network camera captures gesture data of a user from multiple angles, wherein:
(1) defining basic gesture elements;
the invention redefines the basic action elements recognized visually, and calls the determined specific recognizable basic action elements as basic gesture elements and defines the signs of the basic gesture elements.
(2) Selecting a gesture joint point;
the gesture joint points are identified, the identified joint points are connected and labeled in sequence to form a hand posture skeleton, the hand posture is identified by identifying the hand skeleton posture, and the process is defined as gesture estimation. When the fingers are bent, the fingers are usually divided into three small sections, so that the fingers present different bending degrees, and the connection points between the three sections are just the joint points of the fingers, therefore, the invention selects the point of the fingertip part of each finger as the initial joint point of the finger, then connects the joint points on the three small sections on each finger, then the tail joint point on each finger is connected with one joint point on the wrist, and after the joint points of the model are selected, the joint points of the model are labeled and connected according to a certain sequence to form a gesture framework.
(3) Preparing a training sample data set;
the basis for image or video content identification based on convolutional neural networks is a standardized data set. Therefore, the invention carries out video acquisition on the basic gesture elements under the single viewpoint so as to establish the basic gesture element database.
Meanwhile, an existing large data set is generally divided into a training set, a verification set and a test set. The three subsets have no intersection between every two subsets, the union of the three subsets is a complete set, and the three subsets are independently and identically distributed due to the fact that the three subsets are from the same data set. The verification set and the test set are used for testing the accuracy of the model, and both the verification set and the test set are irrelevant to the gradient descent process during model training, but due to the participation of the verification set, the verification result regulates the iteration number and the learning rate of the model, namely the model has a parameter adjustment process, so that the verification set is considered to participate in the model training.
Step three, outputting a gesture Gaussian heat map and gesture joint point coordinates;
the invention adopts a heat map form to label the real value of the gesture joint point, and simultaneously adopts the heat map as the output of the model, wherein the generated gesture joint point heat map takes a certain point of a pixel area where the joint point is located in an image as the center, takes the specific number of pixel points as the radius, draws a circular area, divides the area where the joint point is located out as a probability area where the joint point appears, the color is deepest in the center of the area, the probability of the joint point at the position is shown to be the maximum, and then the color of the area is gradually lightened from the center to the outside. This color will peak in the center and the image form that becomes lighter around resembles a gaussian image, so the gaussian can be used to generate a heat map for each joint area. The coordinates of the heatmap in the present invention are in the form of (x, y), i.e., a formula with a two-dimensional gaussian function:
Figure BDA0002356368640000041
in the formula, x0,y0Representing the real coordinate value of the gesture joint point; x and y represent coordinate values of pixel points in the heat map area of the gesture joint point;
Figure BDA0002356368640000042
representing an amplitude value of a two-dimensional Gaussian function; sigma2And the standard deviation of x and y is shown.
For the size of the probability area of the gesture joint heat map, the invention defines the probability area as a circular area with the radius of 1, wherein the given value of the amplitude A of the two-dimensional Gaussian function is 1, and the given value of the sigma of the two-dimensional Gaussian function is 1.5, so that a distribution image of the two-dimensional Gaussian function is generated.
The method comprises the steps of generating a hotspot graph of a two-dimensional Gaussian function distribution form on the basis of an original picture, generating a probability region in Gaussian distribution by the heatmap on the basis of the central coordinate of a gesture joint region, wherein the probability value at the center of the region is the maximum, namely the peak central point of the two-dimensional Gaussian function diffuses towards the periphery, and the probability value is smaller. In a gaussian probability region with a peak point with the maximum probability value as a center, the sum of all points is more than 1, but in the probability region, the sum of the probabilities of all pixel points at the positions where the gesture joint points appear should be 1, for this reason, the function values of all pixel points in the region are summed, and the function value corresponding to each pixel point is divided by the sum of the function values of the pixel points, so that the probability sum of all points is ensured to be 1, and the processing mode is as follows:
Figure BDA0002356368640000044
in the formula: p (x, y) represents the probability of the processed pixel points existing in the joint points; f (x, y) represents a two-dimensional Gaussian function value corresponding to a pixel point in the probability region; Σ f (x, y) represents the sum of the function values of all pixel points.
In the present invention, these heat maps generated based on two-dimensional gaussian functions are referred to as gaussian heat maps, and at each stage of the model, gaussian heat maps of all joint points are output, i.e. one gaussian heat map is corresponding to each joint point.
Step four, constructing a gesture sequence recognition network;
the specific process of the network model construction is as follows:
(1) defining an activation function;
the number of layers of the recurrent neural network related to the invention is not large, and the problem of gradient disappearance is relatively small under the condition of not deep network layers, so Tanh is adopted as an activation function in the recurrent neural network.
The Tanh activation function is a hyperbolic tangent function, and the expression of Tanh and its derivatives is as follows:
Figure BDA0002356368640000043
Figure BDA0002356368640000051
(2) selecting a loss function;
the method comprises the steps of outputting the category of basic gesture elements in the last layer of the network, calculating the probability that gestures in an input video respectively belong to each category by adopting a multi-category Softmax loss function, and outputting the category with the highest probability in each category as a gesture prediction result in the video by a model.
Assuming that x is a set of feature vectors input into the Softmax layer by the recurrent neural network, and W and b are parameters of Softmax, the first step of Softmax is to score each category, calculate the score value Logit of each category:
Logit=WTx+b (15)
next, the score for each category is converted to a respective probability value using Softmax:
Figure BDA0002356368640000052
where i represents the ith gesture class, eiThe score representing the ith gesture.
The model outputs a probability distribution for each gesture class, which is a predicted value and is referred to as q (x), and each gesture also has an actual label, i.e., a true probability distribution, which is referred to as p (x). Since the Softmax function is also called cross-entropy loss function, while cross-entropy describes the distance problem between two probability distributions, it can be defined as:
H(p,q)=-Σp(x)logq(x) (22)
assuming that p (x) is (a, B, C), q (x) is (u, v, w), and p (x) is the true value, and q (x) is the predicted value, the cross entropy of p (x) is represented by q (x):
H((A,B,C),(u,v,w))=-(Alogu+Blogv+Clogw) (23)
when the positions of q (x) and p (x) are interchanged, the cross entropy of the two is different. The cross entropy is measured by probability, the more the probability of an event occurs, the smaller the amount of information contained in the event, i.e. the smaller the entropy value, therefore, the closer the predicted probability distribution q (x) is to the real value p (x), the smaller the cross entropy of the two is, which means that the closer the output of the model is to the real value, the more accurate the prediction of the model is.
(3) Establishing a model;
in the model, X ═ X1,x2,x3,...,xT) The gesture recognition method is characterized in that the gesture recognition method is a video frame which is expanded according to a time sequence, the time sequence frames serve as input of a recurrent neural network, information contained in each frame is joint point coordinate values of each gesture, and the length of the time sequence is set to be T. The hidden state of the first hidden layer is H ═ H1 (1),h2 (1),...,hT (1)) Then, for the hidden state of the first hidden layer, there are:
Figure BDA0002356368640000053
wherein, the hidden state of the first sequence of the first hidden layer is:
Figure BDA0002356368640000054
for the second hidden layer, the input of the second hidden layer is determined by the hidden state at the previous time and the input of the previous hidden layer that is also in the hidden state at the current time, and the hidden state of the second hidden layer can be expressed as:
Figure BDA0002356368640000061
wherein, the hidden state of the first sequence of the second hidden layer is:
Figure BDA0002356368640000062
for the final output as the predicted classification result for each gesture, Y ═ Y (Y)1,Y2,Y3,Y4,…,Yn) The method comprises the following steps:
Yi=Softmax(VhT+c) (28)
where i ═ 1,2,3,4, …, n, U, W, V are parameter matrices used to matrix the hidden states of the input and hidden layers, b, c are offsets, and all parameters are shared at various stages of the network.
And finally, inputting the joint point coordinates obtained in the step three into a standard gesture sequence recognition network to obtain a gesture action sequence.
The invention has the advantages that:
the invention provides a gesture recognition algorithm fusing a recurrent neural network based on a computer vision technology, and the gesture recognition algorithm is used for recognizing the gesture actions of staff in the production process. The outstanding characteristics are that: aiming at the problem that continuous complex actions are difficult to identify through a computer vision technology in actual production, a CPM model is improved, a gesture joint coordinate identification network model is established to obtain gesture joint coordinates of gesture video samples collected under a single viewpoint, the gesture joint coordinates are input into a corrected standard gesture sequence identification network, a gesture action sequence is obtained, and continuous actions are identified.
Drawings
FIG. 1 is a model structure of a VGG-13 of the present invention;
FIG. 2 is a schematic diagram of 21 selected gesture joints according to the present invention;
FIG. 3 is a schematic diagram of the gesture joint point labels and skeleton of the present invention;
4 a-4 e are screenshots of video samples of 5 basic gesture elements of the present invention; where fig. 4a is a hands-free movement, fig. 4b is a release or placement, fig. 4c is a rotation, fig. 4d is a load movement, fig. 4e is a grasping;
FIG. 5 is a two-dimensional Gaussian function distribution plot of the present invention;
FIG. 6 is a graph of Tanh activation function and its derivative function according to the present invention;
FIG. 7 is a schematic diagram of a recurrent neural network architecture of the present invention;
FIG. 8 is a schematic diagram of a recurrent neural network structure for five gesture classes in accordance with the present invention;
FIG. 9 is a gradient descent process of the minimization of loss function of the present invention;
FIG. 10 is a graph of the accuracy rate of the single viewpoint model for five basic gesture element recognition;
FIG. 11 is a flowchart of the deep learning based dynamic gesture recognition method of the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
Based on the problems, the invention provides a gesture action recognition method based on computer vision, which comprises the steps of firstly improving a CPM model to construct a gesture joint point coordinate recognition network model, then collecting a gesture video under a single view point, and then transmitting the collected video into a standard gesture joint point coordinate recognition network to obtain a gesture Gaussian heat map and joint point coordinates. And inputting the coordinates of the relevant nodes into a standard gesture sequence recognition network to obtain a gesture action sequence, and finally realizing the recognition of continuous actions.
In order to verify the feasibility and the superiority of the method provided by the invention, five basic gestures are selected for verification and test, and the method comprises the following steps:
step one, constructing a gesture joint point coordinate identification network;
the invention utilizes an improved CPM model to process a gesture video and output the coordinates of gesture joint points under a single viewpoint, and the realization process comprises the following steps:
(1) selecting a basic network model for gesture joint point estimation;
the method selects VGG-13 as a basic network model for gesture joint point estimation, wherein the VGG-13 is composed of 5 groups of convolution groups, 5 pooling groups, 3 full connections and 1 softmax classification layer.
(2) Setting a receptive field;
the size of the receptive field is related to the sliding window of convolution or pooling, and both are considered as a map, compressing the k × k range of pixel values on the n-layer feature map into one pixel on the n + 1-layer feature map, denoted as fksWhere s represents the step size of the sliding window, k represents the size of the convolution kernel or pooling kernel, and the mapping relationship is:
Figure BDA0002356368640000071
wherein: x is the number ofn,xn+1Characteristic diagram of the n-th layer and the n + 1-th layer。
The basic network structure of the invention is based on VGG-13, and for the first part of VGG-13, two convolutions and one pooling are included, and the three structures form a cascade, so that the mapping process is repeated in the network for many times to form a multi-level mapping. Considering a 6 × 6 region of an original image, for the first design process, the number of layers of convolution is two, the convolution kernel size of each layer is 3 × 3, the step size is 1, the pooling layer is 1, the pooling kernel size is 2 × 2, and the step size is 2. For the feature map output by the first convolution layer, since the size of the convolution kernel is 3 × 3, the receptive field of a pixel point in the feature map on the original image is 3 × 3. For the feature map output by the second convolutional layer, the size of the convolutional kernel of the layer is still 3 × 3, then the receptive field of the pixel point on the second feature map on the first feature map is also 3 × 3, at this time, the 3 × 3 area on the first feature map needs to be pushed back to the original image, and according to the relationship between the receptive fields of the first layer and the original image, the 3 × 3 area of the first layer feature map can be intuitively pushed out from the image to correspond to the 5 × 5 area of the original image, that is, the receptive field of the feature map output by the second convolutional layer on the original image is 5 × 5. For the feature map of the last pooling layer, after pooling, a single pixel is output, and then the corresponding receptive field of the feature map on the second feature map is 2 × 2, similarly, the receptive field of the region corresponding to the first feature map is 4 × 4, and after reverse pushing again, the corresponding receptive field on the original image should be 6 × 6, which means that the receptive field of the feature map output by the last pooling layer corresponding to the original image is 6 × 6. The parameters of the receptive field and convolution kernel or pooling kernel of each link are shown in table 1, and the receptive field of the original image is 1 × 1:
TABLE 1 Acceptor field and convolution kernel parameters corresponding to each layer profile under cascade
Figure BDA0002356368640000072
RF memorynIs the receptive field of the nth feature map, KnSize of convolution or pooling kernel for nth convolution layer, SnIs KnThe relationship between the receptive field and the step size and the size of the convolution kernel can be deduced from the receptive field rule in table 1.
The receptive field size of the feature map after the first layer of convolution is the size of the convolution kernel:
RF1=K1(2)
when the step length is 1, the size of the receptive field of the n & gtth characteristic diagram is as follows:
RFn=RFn-1+(Kn-1) (3)
for the case that the step length is not 1, n is more than or equal to 2:
RFn=RFn-1+(Kn-1)×Sn(4)
if the design of the cascade structure is changed into a single convolution layer, the equivalent receptive field can also be achieved, the size of the convolution kernel at this time is 6 × 6, the step length is 1, and according to the formula (2), the receptive field of the output feature map after the convolution of the first layer is equal to the size of the convolution kernel, namely 6 × 6. The VGG-13 is selected as the basic network structure in the invention, because the utilization of the receptive field structure by the VGG-13, namely, two convolutions and a pooled cascade structure are used to replace a convolution of 6 x 6, the following advantages are achieved: 1) reducing the network parameters; 2) the nonlinear structure of the network is reinforced.
(3) Extracting features;
the invention utilizes a basic network model VGG-13 to extract the characteristics of the image.
Firstly, defining the position coordinate of the p-th joint in the image pixel as YpThen, there is,
Figure BDA0002356368640000081
where the set Z represents the position of all pixels in the image.
Setting P joint points to be predicted, and obtaining the coordinates Y of all the P joint points by the target:
Y=(Y1,Y2,…,Yp) (6)
from the above relationship, Y is a subset of Z.
Then defining a multi-stage prediction classifier gt(x) And the method is used for predicting the position of each joint point in each stage. At each stage T ∈ {1,2, … T }, the prediction classifier assigns a point z in the image to YpAnd generating a heat map for each gesture joint at each stage, wherein the specific expression is as follows:
Figure BDA0002356368640000082
when the classifier predicts the gesture joint point location in the first stage, it generates a heat map and corresponding gesture joint point confidence scores:
Figure BDA0002356368640000083
wherein b is1 p(YpZ) is the gesture joint confidence score for the classifier when predicting the pth gesture joint at the z position in the first stage.
For each of the following stages, the confidence score for the p-th gesture joint at z-position may be expressed as:
Figure BDA0002356368640000084
wherein u, v represent coordinate values of a position z in the image.
In a subsequent stage t (t is more than or equal to 2), based on the heat map of the previous stage and the confidence scores of the gesture joint points, continuously assigning a more accurate position coordinate z to each gesture joint point, wherein the more accurate position z is determined based on the image features extracted by the classifier in the first stage and the context information of the image extracted by the classifier in the previous stage, and similarly, the confidence score of the gesture joint point corresponding to the heat map of the gesture joint point belonging to each stage is still generated by the prediction classifier in each subsequent stage:
Figure BDA0002356368640000091
wherein psit(z,bt-1) Representing a mapping between a confidence score and image context information, Xz' represents the image features extracted by the previous stage around position z.
Under the continuous repetition of the above processes, the position of the p-th gesture joint point is corrected in each stage based on the image context information in the previous stage and the image features extracted in the first stage, and the model finally estimates the more accurate coordinate position of the gesture joint point through the gradual fine adjustment process.
Collecting single-viewpoint video data;
the invention collects gesture video samples in a single-viewpoint mode, namely a common network camera captures gesture data of a user from multiple angles, wherein:
(1) defining basic gesture elements;
the invention adjusts basic action elements of visual recognition on the basis of a model method, eighteen kinds of kinematical elements and the like, redefines action recognition elements, determines 5 specifically recognizable basic action elements, is called as basic gesture elements, namely, Empty hand movement, load movement, rotation, grabbing, releasing or placing, defines symbols thereof, and respectively represents Empty Move, Turn, Grasp and Release, and is specifically shown in table 2:
TABLE 2 basic gesture element Table
Figure BDA0002356368640000092
(2) Selecting a gesture joint point;
the invention realizes the recognition of the posture of the hand by recognizing gesture joint points and connecting the recognized joint points in sequence to form a skeleton of the posture of the hand, and defines the process as gesture estimation.
When the fingers are bent, the fingers can be seen to be divided into three small sections, so that the fingers present different bending degrees, and the connection points between the three sections are just the joint points of the fingers, therefore, the invention selects the point of the fingertip part of each finger as the initial joint point of the finger, then connects the joint points on the three small sections on each finger, and finally, the tail joint point on each finger is connected with one joint point on the wrist to form the skeleton of the hand posture, namely, 21 gesture joint points are selected in total.
After the joint points of the model are selected, the joint points of the model are labeled and connected according to a certain sequence to form a gesture skeleton. The joint point at the wrist is the final connection point of each finger, so the joint point is marked as number 1 as the starting point of the gesture joint point. And then according to the spatial distance of the joint points, marking the four joint points of the thumb as 2,3,4 and 5 from bottom to top in sequence, namely, the finger tip is the tail end of each finger, and similarly, marking each finger in sequence from bottom to top.
(3) Preparing a training sample data set;
the basis for image or video content identification based on convolutional neural networks is a standardized data set. Since the invention needs to recognize specific 5 basic gesture elements, a sample data set of short video gesture elements with 5 basic gesture elements as the standard is established.
Video acquisition is carried out on 5 basic gesture elements under a single view point, 500 short videos of 1-2 seconds are acquired by each gesture and are completed by 20 different people, each person shoots 50 short videos by each gesture, and 5000 gesture short videos are obtained in total so as to establish a basic gesture element database.
For an existing large data set, if training of a supervised learning model is to be completed and accuracy of the supervised learning model is to be tested, the large data set is usually divided into a training set, a verification set and a test set according to a certain proportion, such as 8:1: 1. The three subsets have no intersection between every two subsets, the union of the three subsets is a complete set, and the three subsets are independently and identically distributed due to the fact that the three subsets are from the same data set. Although the verification set and the test set are both used for testing the accuracy of the model and are not related to the gradient descent process during model training, due to the participation of the verification set, the verification result regulates the iteration number and the learning rate of the model, namely the model has a parameter adjustment process, so that the verification set is considered to participate in the training of the model.
Based on the division rule of the data set, under the condition that the samples are independently and uniformly distributed, 5000 video samples are divided into a training set, a verification set and a test set according to the ratio of 8:1:1 in a uniform random sampling mode. The sample compositions of the divided training set, validation set and test set are shown in table 3, table 4 and table 5 below.
TABLE 3 basic gesture element training set sample composition
Figure BDA0002356368640000101
TABLE 4 basic gesture element verification set sample composition
Figure BDA0002356368640000102
TABLE 5 basic gesture element test set sample composition
Figure BDA0002356368640000111
Step three, outputting a gesture Gaussian heat map and gesture joint point coordinates;
the invention adopts a heat map form to label the real value of the gesture joint point, and simultaneously adopts the heat map as the output of the model, wherein the generated gesture joint point heat map takes a certain point of a pixel area where the joint point is located in an image as the center, takes the specific number of pixel points as the radius, draws a circular area, divides the area where the joint point is located out as a probability area where the joint point appears, the color is deepest in the center of the area, the probability of the joint point at the position is shown to be the maximum, and then the color of the area is gradually lightened from the center to the outside. This color will peak in the center and the image form that becomes lighter around resembles a gaussian image, so the gaussian can be used to generate a heat map for each joint area. The coordinates of the heatmap in the present invention are in the form of (x, y), i.e., a formula with a two-dimensional gaussian function:
Figure BDA0002356368640000112
in the formula, x0,y0Representing the real coordinate value of the gesture joint point; x and y represent coordinate values of pixel points in the heat map area of the gesture joint point;
Figure BDA0002356368640000113
representing an amplitude value of a two-dimensional Gaussian function; sigma2And the standard deviation of x and y is shown.
For the size of the probability area of the gesture joint heat map, the invention defines the probability area as a circular area with the radius of 1, wherein the given value of the amplitude A of the two-dimensional Gaussian function is 1, and the given value of the sigma of the two-dimensional Gaussian function is 1.5, so that a distribution image of the two-dimensional Gaussian function is generated.
The method comprises the steps of generating a hotspot graph of a two-dimensional Gaussian function distribution form on the basis of an original picture, generating a probability region in Gaussian distribution by the heatmap on the basis of the central coordinate of a gesture joint region, wherein the probability value at the center of the region is the maximum, namely the peak central point of the two-dimensional Gaussian function diffuses towards the periphery, and the probability value is smaller. In a gaussian probability region with a peak point with the maximum probability value as a center, the sum of all points is more than 1, but in the probability region, the sum of the probabilities of all pixel points at the positions where the gesture joint points appear should be 1, for this reason, the function values of all pixel points in the region are summed, and the function value corresponding to each pixel point is divided by the sum of the function values of the pixel points, so that the probability sum of all points is ensured to be 1, and the processing mode is as follows:
Figure BDA0002356368640000114
in the formula: p (x, y) represents the probability of the processed pixel points existing in the joint points; f (x, y) represents a two-dimensional Gaussian function value corresponding to a pixel point in the probability region; Σ f (x, y) represents the sum of the function values of all pixel points.
In the present invention, these heat maps generated based on two-dimensional gaussian functions are referred to as gaussian heat maps, and at each stage of the model, gaussian heat maps of all joint points are output, i.e. one gaussian heat map is corresponding to each joint point.
Step four, constructing a gesture sequence recognition network;
the specific process of the network model construction is as follows:
(1) defining an activation function;
the number of layers of the recurrent neural network related to the invention is not large, and the problem of gradient disappearance is relatively small under the condition of not deep network layers, so Tanh is adopted as an activation function in the recurrent neural network.
The Tanh activation function is a hyperbolic tangent function, and the expression of Tanh and its derivatives is as follows:
Figure BDA0002356368640000121
Figure BDA0002356368640000122
(2) selecting a loss function;
the method comprises the steps of outputting the category of basic gesture elements in the last layer of the network, calculating the probability that gestures in an input video respectively belong to each category by adopting a multi-category Softmax loss function, and outputting the category with the highest probability in each category as a gesture prediction result in the video by a model.
Assuming that x is a set of feature vectors input into the Softmax layer by the recurrent neural network, and W and b are parameters of Softmax, the first step of Softmax is to score each category, calculate the score value Logit of each category:
Logit=WTx+b (15)
next, Softmax converts the score of each category into a respective probability value, and assuming that the scores of the five gesture categories are (c, d, e, f, g), the formula for Softmax converting the scores into the probability values can be expressed as:
Figure BDA0002356368640000123
where i represents the ith gesture class, eiThe score representing the ith gesture. The probabilities for the five gesture categories may be expressed as:
Figure BDA0002356368640000124
Figure BDA0002356368640000125
Figure BDA0002356368640000126
Figure BDA0002356368640000127
Figure BDA0002356368640000128
the model thus far outputs a probability distribution of five gesture classes, which is a predicted value and is referred to as q (x), and the gesture also carries an actual label, i.e. a true probability distribution, which is referred to as p (x). Since the Softmax function is also called cross-entropy loss function, while cross-entropy describes the distance problem between two probability distributions, it can be defined as:
H(p,q)=-∑p(x)logq(x) (22)
assuming that p (x) is (a, B, C), q (x) is (u, v, w), and p (x) is the true value, and q (x) is the predicted value, the cross entropy of p (x) is represented by q (x):
H((A,B,C),(u,v,w))=-(Alogu+Blogv+Clogw) (23)
when the positions of q (x) and p (x) are interchanged, the cross entropy of the two is different. The cross entropy is measured by probability, the more the probability of an event occurs, the smaller the amount of information contained in the event, i.e. the smaller the entropy value, therefore, the closer the predicted probability distribution q (x) is to the real value p (x), the smaller the cross entropy of the two is, which means that the closer the output of the model is to the real value, the more accurate the prediction of the model is.
(3) Establishing a model;
in the model, X ═ X1,x2,x3,...,xT) The gesture recognition method is characterized in that the gesture recognition method is a video frame which is expanded according to a time sequence, the time sequence frames serve as input of a recurrent neural network, information contained in each frame is joint point coordinate values of each gesture, and the length of the time sequence is set to be T. The hidden state of the first hidden layer is H ═ H1 (1),h2 (1),...,hT (1)) Then, for the hidden state of the first hidden layer, there are:
Figure BDA0002356368640000131
wherein, the hidden state of the first sequence of the first hidden layer is:
Figure BDA0002356368640000132
for the second hidden layer, the input of the second hidden layer is determined by the hidden state at the previous time and the input of the previous hidden layer that is also in the hidden state at the current time, and the hidden state of the second hidden layer can be expressed as:
Figure BDA0002356368640000133
wherein, the hidden state of the first sequence of the second hidden layer is:
Figure BDA0002356368640000134
for the final output as the predicted classification result of five gestures, Y ═ Y (Y)1,Y2,Y3,Y4,Y5) The method comprises the following steps:
Yi=Softmax(VhT+c) (28)
where i ═ 1,2,3,4,5), U, W, V are parameter matrices used to matrix transform the hidden states of the input and hidden layers, b, c are offsets, and all parameters are shared at various stages of the network.
And finally, inputting the joint point coordinates obtained in the step three into a standard gesture sequence recognition network to obtain a gesture action sequence.
(4) Updating a model based on a gradient descent method;
the neural network reversely propagates the loss function of the output layer back to the network by utilizing a gradient descent algorithm, and the contribution rate of the parameters to the loss is obtained, so that the parameters in the network are updated layer by layer. The gradient is the derivative in the differentiation, and the parameters of the loss function in the actual model are multivariate, so the partial derivative needs to be calculated on the parameters of the multivariate function, and the gradient is defined as:
Figure BDA0002356368640000141
for the minimization of the optimization problem, the principle of the gradient descent method is the loss function J (θ)12,…,θn) If the parameter is updated by a step size, which is also called a learning rate, in a direction in which the gradient of one parameter decreases the fastest, and the value of the loss function decreases the fastest, the parameter update process can be represented as the following process:
1) one direction of gradient descent is selected as the direction of the minimization loss function. The selected fastest gradient descent direction is the gradient maximum direction of a certain parameter:
Figure BDA0002356368640000142
2) the step size of the gradient descent, i.e., the learning rate η, is selected.
3) And (3) adjusting and updating the weight:
θi←θi-η·ΔθJ(θi) (31)
the gradient is propagated forwards layer by layer according to the processes to form a chain type derivation process, each layer of parameters are updated according to the three steps each time until the model training is finished, and the optimal solution is found.
(5) Training a model;
the invention inputs a video sequence, wherein the video sequence is a frame sequence arranged according to a time sequence, so that the input in each state is a video frame input at each moment. For a frame sequence of time length T, there is a loss function L at each instant(t)Then the sum of the losses at all times constitutes the total loss function:
Figure BDA0002356368640000143
the input video is then predictively classified to be as consistent as possible with the given true label, and therefore a process is performed to bring the predicted value as close as possible to the true value, i.e. to minimize the loss function. Parameters in the network are updated in order to minimize the loss function. The output of each time sequence is o(t)Loss of L(t)Is formed by(t)Converted by the Softmax function, therefore, during the gradient back propagation of the loss function, the output o needs to be processed(t)The gradient is calculated by parameters V and c contained in the formula (1), and the parameters are respectively:
Figure BDA0002356368640000144
Figure BDA0002356368640000145
for the loss function of a single sequence, only the gradient of the parameters V and c at the moment is needed to be calculated, the gradients of the parameters W, U and b are all related to the gradient of the hidden layer, and as can be seen from the structure of the recurrent neural network, the gradient of the hidden layer at the moment t is not only related to the loss function of the sequence at the current momentIt also relates to the loss function at time t + 1. Then the gradient of the hidden layer at time t is first defined, denoted as δ: (t):
Figure BDA0002356368640000151
Then, because the gradient of the hidden layer at time t is determined by two time loss functions, the true gradient is the sum of the partial derivative of the hidden layer by the time loss function at time t and the partial derivative of the hidden layer by the time loss function at time t +1, that is:
Figure BDA0002356368640000152
for the last frame sequence T, since this sequence is already in the end phase. The gradient of its hidden layer is no longer affected by the gradient of the loss function at the next instant, the gradient of the last sequence can be expressed as:
Figure BDA0002356368640000153
the gradient calculation can then be performed on the parameter W, U, b, for W, there is a gradient:
Figure BDA0002356368640000154
for U, the gradient is:
Figure BDA0002356368640000155
for b, there is a gradient:
Figure BDA0002356368640000156
by repeating the back propagation process, the parameter values are continuously updated, the purpose of loss function optimization is achieved, the model is finally converged, and a better gesture classification accuracy is achieved.
(6) Analyzing the experimental result;
the experimental development environment of the present invention is shown in tables 6 and 7 below, where table 6 lists the hardware environment of the experimental computer, table 7 lists the experimental development environment including specific contents such as development language and development framework, and table 8 lists the parameters of the model.
TABLE 6 Experimental computer configuration
Figure BDA0002356368640000157
TABLE 7 Experimental development Environment
Figure BDA0002356368640000161
TABLE 8 training parameters
Figure BDA0002356368640000162
Training video samples collected under a single viewpoint, dividing video data under the single viewpoint according to the proportion of 8:1:1 of the training set, the verification set and the test set mentioned above, wherein in the invention, labels of 5 gesture samples are set as: the method comprises the steps of moving by hands, moving by load, rotating, grabbing and releasing, then training a model according to parameters set in a table 8, wherein the initial learning rate is 0.001, the learning attenuation rate is 0.94, a gradient descent method is used for carrying out back propagation training, the parameters after model training are closer to the real condition along with the increase of training iteration times, the learning rate is attenuated at the moment, and the minimum learning rate after attenuation is 0.0001. The size of the read video frame during training is 408 multiplied by 720, and the video length is between 1 and 2 seconds, so that the length of the video frame read each time is indefinite, and the value of the loss function is output once after one-step iteration is completed. With the training, the loss function will continuously decrease, the accuracy of the model will continuously increase, then the model will tend to be stable, and finally the convergence is reached.
The invention has the advantages that:
the invention provides a gesture recognition algorithm fusing a recurrent neural network based on a computer vision technology, and the gesture recognition algorithm is used for recognizing the gesture actions of staff in the production process. The outstanding characteristics are that: aiming at the problem that continuous complex actions are difficult to identify through a computer vision technology in actual production, a CPM model is improved, a gesture joint coordinate identification network model is established to obtain gesture joint coordinates of gesture video samples collected under a single viewpoint, the gesture joint coordinates are input into a corrected standard gesture sequence identification network, a gesture action sequence is obtained, and continuous actions are identified.
The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims (1)

1. A dynamic gesture action recognition method based on deep learning comprises the following steps:
step one, constructing a gesture joint point coordinate identification network;
the method comprises the following steps of processing a gesture video by using an improved CPM model, and outputting gesture joint point coordinates under a single view point, wherein the implementation process comprises the following steps:
(1.1) selecting a base network model for gesture joint point estimation;
selecting VGG-13 as a basic network model for gesture joint point estimation;
(1.2) setting a receptive field;
the size of the receptive field is related to the sliding window of convolution or pooling, and both are considered as a map, compressing the k × k range of pixel values on the n-layer feature map into one pixel on the n + 1-layer feature map, denoted as fksWhere s represents the step size of the sliding window, k represents the size of the convolution kernel or pooling kernel, and the mapping relationship is:
Figure FDA0002356368630000011
wherein: x is the number ofn,xn+1Characteristic diagrams of the nth layer and the (n + 1) th layer are shown;
the basic network structure is based on VGG-13, and for the first part of the VGG-13, two convolutions and a pooling are included, and the three structures form a cascade connection, so that the mapping process is repeated in the network for multiple times to form a multi-level mapping; the parameters of the receptive field and convolution kernel or pooling kernel for each link are shown in table 1:
TABLE 1 Acceptor field and convolution kernel parameters corresponding to each layer profile under cascade
Figure FDA0002356368630000012
RF memorynIs the receptive field of the nth feature map, KnSize of convolution or pooling kernel for nth convolution layer, SnIs KnThe relationship between the receptive field and the step length and the size of the convolution kernel can be deduced according to the receptive field rule in the table 1;
the receptive field size of the feature map after the first layer of convolution is the size of the convolution kernel:
RF1=K1(2)
when the step length is 1, the size of the receptive field of the n & gtth characteristic diagram is as follows:
RFn=RFn-1+(Kn-1) (3)
for the case that the step length is not 1, n is more than or equal to 2:
RFn=RFn-1+(Kn-1)×Sn(4)
(1.3) extracting features;
extracting the features of the image by using a basic network model VGG-13;
firstly, defining the position coordinate of the p-th joint in the image pixel as YpThen, there is,
Figure FDA0002356368630000021
wherein the set Z represents the positions of all pixels in the image;
setting P joint points to be predicted, and obtaining the coordinates Y of all the P joint points by the target:
Y=(Y1,Y2,…,Yp) (6)
from the above relationship, Y is a subset of Z;
then defining a multi-stage prediction classifier gt(x) The system is used for predicting the position of each joint point in each stage; at each stage T ∈ {1,2, … T }, the prediction classifier assigns a point z in the image to YpAnd generating a heat map for each gesture joint at each stage, wherein the specific expression is as follows:
Figure FDA0002356368630000022
when the classifier predicts the gesture joint point location in the first stage, it generates a heat map and corresponding gesture joint point confidence scores:
Figure FDA0002356368630000023
wherein b is1 p(YpZ) is the gesture joint confidence score for the classifier when predicting the pth gesture joint at the z position in the first stage;
for each of the following stages, the confidence score for the p-th gesture joint at z-position may be expressed as:
Figure FDA0002356368630000024
wherein u and v represent coordinate values of a certain position z in the image;
in a subsequent stage t (t is more than or equal to 2), based on the heat map of the previous stage and the confidence scores of the gesture joint points, continuously assigning a more accurate position coordinate z to each gesture joint point, wherein the more accurate position z is determined based on the image features extracted by the classifier in the first stage and the context information of the image extracted by the classifier in the previous stage, and similarly, the confidence score of the gesture joint point corresponding to the heat map of the gesture joint point belonging to each stage is still generated by the prediction classifier in each subsequent stage:
Figure FDA0002356368630000025
wherein psit(z,bt-1) Representing a mapping, X ', between a confidence score and image context information'zRepresenting the image features extracted by the previous stage around the position z;
under the continuous repetition of the processes, the position of the p-th gesture joint point is corrected in each stage based on the image context information in the previous stage and the image characteristics extracted in the first stage, and the model finally estimates a more accurate coordinate position of the gesture joint point through the gradual fine adjustment process;
collecting single-viewpoint video data;
the gesture video samples are collected in a single-view mode, namely, a common network camera is used for capturing gesture data of a user from multiple angles, wherein:
(2.1) defining a basic gesture element;
redefining the basic action elements recognized visually, and calling the determined specific recognizable basic action elements as basic gesture elements and defining the signs of the basic action elements;
(2.2) selecting a gesture joint point;
the gesture joint points are identified, the identified joint points are connected and labeled according to the sequence to form a hand gesture framework, the gesture of the hand is identified by identifying the gesture of the hand framework, and the process is defined as gesture estimation; when the fingers are bent, the fingers are usually divided into three small sections, so that the fingers present different bending degrees, and the connection points between the three sections are just the joint points of the fingers, therefore, the point of the fingertip part of each finger is selected as the initial joint point of the finger, then the joint points on the three small sections on each finger are connected, then the tail joint point on each finger is connected with one joint point on the wrist, after the joint points of the model are selected, the joint points of the model are labeled and connected according to a certain sequence to form a gesture skeleton;
(3) preparing a training sample data set;
the basis of the identification of image or video content based on a convolutional neural network is a standardized data set; therefore, video acquisition is carried out on the basic gesture elements under the single view point so as to establish a basic gesture element database;
meanwhile, for an existing large data set, the existing large data set is usually divided into a training set, a verification set and a test set; the three subsets have no intersection between every two subsets, the union of the three subsets is a complete set, and the three subsets are independently and identically distributed because of being from the same data set; the verification set and the test set are used for testing the accuracy of the model, and both the verification set and the test set are irrelevant to the gradient descent process during model training, but due to the participation of the verification set, the verification result regulates the iteration number and the learning rate of the model, namely the model has a parameter adjustment process, so that the verification set is considered to participate in the model training;
step three, outputting a gesture Gaussian heat map and gesture joint point coordinates;
marking the real value of the gesture joint point in a heat map form, and simultaneously, outputting a heat map as a model, wherein the generated gesture joint point heat map takes a certain point of a pixel area where the joint point is located in an image as a center, takes the specific number of pixel points as a radius, draws a circular area, divides the area where the joint point is located out as a probability area where the joint point appears, the color of the center of the area is the deepest, the probability of the joint point at the position is shown to be the largest, and then the color of the area gradually becomes lighter from the center outwards; the color will peak at the center, and the image form with gradually lighter periphery has similarity with the Gaussian function image, so the Gaussian function can be used to generate the heat map of each joint point region; the coordinates of the heatmap are in the form of (x, y), i.e., a formula with a two-dimensional gaussian function:
Figure FDA0002356368630000031
in the formula, x0,y0Representing the real coordinate value of the gesture joint point; x and y represent coordinate values of pixel points in the heat map area of the gesture joint point;
Figure FDA0002356368630000041
representing an amplitude value of a two-dimensional Gaussian function; sigma2Represents the standard deviation of x, y;
for the size of a probability area of the gesture joint heat map, defining the probability area as a circular area with the radius of 1, wherein for the amplitude A of a two-dimensional Gaussian function, the given value is 1, and the given value is 1.5, a distribution image of the two-dimensional Gaussian function is generated;
generating a hotspot graph of a two-dimensional Gaussian function distribution form on the basis of an original picture, generating a probability area in Gaussian distribution by the heatmap based on the central coordinate of a gesture joint point area, wherein the probability value at the center of the area is the maximum, namely the peak central point of the two-dimensional Gaussian function diffuses towards the periphery, and the probability value is smaller; in a gaussian probability region with a peak point with the maximum probability value as a center, the sum of all points is more than 1, but in the probability region, the sum of the probabilities of all pixel points at the positions where the gesture joint points appear should be 1, for this reason, the function values of all pixel points in the region are summed, and the function value corresponding to each pixel point is divided by the sum of the function values of the pixel points, so that the probability sum of all points is ensured to be 1, and the processing mode is as follows:
Figure FDA0002356368630000042
in the formula: p (x, y) represents the probability of the processed pixel points existing in the joint points; f (x, y) represents a two-dimensional Gaussian function value corresponding to a pixel point in the probability region; Σ f (x, y) represents the sum of the function values of all the pixel points;
the heat maps generated based on the two-dimensional Gaussian functions are called Gaussian heat maps, and the Gaussian heat maps of all joint points are output at each stage of the model, namely each joint point corresponds to one Gaussian heat map;
step four, constructing a gesture sequence recognition network;
the specific process of the network model construction is as follows:
(4.1) defining an activation function;
because the number of layers of the involved recurrent neural network is not large, the problem of gradient disappearance is relatively small under the condition of not deep network layers, and Tanh is adopted as an activation function in the recurrent neural network;
the Tanh activation function is a hyperbolic tangent function, and the expression of Tanh and its derivatives is as follows:
Figure FDA0002356368630000043
Figure FDA0002356368630000044
(4.2) selecting a loss function;
the method comprises the steps that classes of basic gesture elements need to be output in the last layer of a network, the probability that gestures in an input video belong to each class is calculated by adopting a multi-class Softmax loss function, and finally a model outputs the class with the highest probability in the class as a gesture prediction result in the video;
assuming that x is a set of feature vectors input into the Softmax layer by the recurrent neural network, and W and b are parameters of Softmax, the first step of Softmax is to score each category, calculate the score value Logit of each category:
Logit=WTx+b (15)
next, the score for each category is converted to a respective probability value using Softmax:
Figure FDA0002356368630000051
where i represents the ith gesture class, eiScore value representing ith gesture;
The model outputs the probability distribution of each gesture category, the probability distribution is a predicted value and is called q (x), and each gesture also has an actual label, namely a real probability distribution and is called p (x); since the Softmax function is also called cross-entropy loss function, while cross-entropy describes the distance problem between two probability distributions, it can be defined as:
H(p,q)=-∑p(x)logq(x) (22)
assuming that p (x) is (a, B, C), q (x) is (u, v, w), and p (x) is the true value, and q (x) is the predicted value, the cross entropy of p (x) is represented by q (x):
H((A,B,C),(u,v,w))=-(Alogu+Blogv+Clogw) (23)
when the positions of q (x) and p (x) are interchanged, the cross entropy of the two is different; the cross entropy is used for measuring the occurrence probability of an event through probability, the greater the occurrence probability of an event is, the smaller the information content contained in the event is, namely the entropy value is smaller, therefore, when the predicted probability distribution q (x) is closer to the real value p (x), the cross entropy of the predicted probability distribution q (x) and the real value p (x) is smaller, which means that the output of the model is closer to the real value, and the prediction of the model is more accurate;
(4.3) establishing a model;
in the model, X ═ X1,x2,x3,...,xT) The gesture recognition method comprises the following steps that video frames are expanded according to a time sequence, the time sequence frames serve as input of a recurrent neural network, information contained in each frame is a joint coordinate value of each gesture, and the length of the time sequence is set to be T; the hidden state of the first hidden layer is H ═ H1 (1),h2 (1),...,hT (1)) Then, for the hidden state of the first hidden layer, there are:
Figure FDA0002356368630000052
wherein, the hidden state of the first sequence of the first hidden layer is:
Figure FDA0002356368630000053
for the second hidden layer, the input of the second hidden layer is determined by the hidden state at the previous time and the input of the previous hidden layer that is also in the hidden state at the current time, and the hidden state of the second hidden layer can be expressed as:
Figure FDA0002356368630000054
wherein, the hidden state of the first sequence of the second hidden layer is:
Figure FDA0002356368630000055
for the final output as the predicted classification result for each gesture, Y ═ Y (Y)1,Y2,Y3,Y4,…,Yn) The method comprises the following steps:
Yi=Softmax(VhT+c) (28)
where i ═ 1,2,3,4, …, n), U, W, V are parameter matrices used to matrix transform hidden states of the input and hidden layers, b, c are offsets, and all parameters are shared at various stages of the network;
and finally, inputting the joint point coordinates obtained in the third step into a standard gesture sequence recognition network to obtain a gesture action sequence.
CN202010011805.1A 2020-01-06 2020-01-06 Dynamic gesture action recognition method based on deep learning Active CN111209861B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010011805.1A CN111209861B (en) 2020-01-06 2020-01-06 Dynamic gesture action recognition method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010011805.1A CN111209861B (en) 2020-01-06 2020-01-06 Dynamic gesture action recognition method based on deep learning

Publications (2)

Publication Number Publication Date
CN111209861A true CN111209861A (en) 2020-05-29
CN111209861B CN111209861B (en) 2022-03-18

Family

ID=70789567

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010011805.1A Active CN111209861B (en) 2020-01-06 2020-01-06 Dynamic gesture action recognition method based on deep learning

Country Status (1)

Country Link
CN (1) CN111209861B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881994A (en) * 2020-08-03 2020-11-03 杭州睿琪软件有限公司 Recognition processing method and apparatus, and non-transitory computer-readable storage medium
CN111950341A (en) * 2020-06-19 2020-11-17 南京邮电大学 Real-time gesture recognition method and gesture recognition system based on machine vision
CN112102451A (en) * 2020-07-28 2020-12-18 北京云舶在线科技有限公司 Common camera-based wearable virtual live broadcast method and equipment
CN112699837A (en) * 2021-01-13 2021-04-23 新大陆数字技术股份有限公司 Gesture recognition method and device based on deep learning
CN112862096A (en) * 2021-02-04 2021-05-28 百果园技术(新加坡)有限公司 Model training and data processing method, device, equipment and medium
CN113196289A (en) * 2020-07-02 2021-07-30 浙江大学 Human body action recognition method, human body action recognition system and device
CN113269089A (en) * 2021-05-25 2021-08-17 上海人工智能研究院有限公司 Real-time gesture recognition method and system based on deep learning
CN113313161A (en) * 2021-05-24 2021-08-27 北京大学 Object shape classification method based on rotation invariant canonical invariant network model
CN113743247A (en) * 2021-08-16 2021-12-03 电子科技大学 Gesture recognition method based on Reders model
CN114185429A (en) * 2021-11-11 2022-03-15 杭州易现先进科技有限公司 Method for positioning gesture key points or estimating gesture, electronic device and storage medium
CN114499712A (en) * 2021-12-22 2022-05-13 天翼云科技有限公司 Gesture recognition method, device and storage medium
CN115273244A (en) * 2022-09-29 2022-11-01 合肥工业大学 Human body action recognition method and system based on graph neural network
TWI787841B (en) * 2021-05-27 2022-12-21 中強光電股份有限公司 Image recognition method
US20230107097A1 (en) * 2021-10-06 2023-04-06 Fotonation Limited Method for identifying a gesture
CN116645727A (en) * 2023-05-31 2023-08-25 江苏中科优胜科技有限公司 Behavior capturing and identifying method based on Openphase model algorithm
CN116959120A (en) * 2023-09-15 2023-10-27 中南民族大学 Hand gesture estimation method and system based on hand joints
CN116974369A (en) * 2023-06-21 2023-10-31 广东工业大学 Method, system, equipment and storage medium for operating medical image in operation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991372A (en) * 2017-03-02 2017-07-28 北京工业大学 A kind of dynamic gesture identification method based on interacting depth learning model
WO2019006473A1 (en) * 2017-06-30 2019-01-03 The Johns Hopkins University Systems and method for action recognition using micro-doppler signatures and recurrent neural networks
CN110287844A (en) * 2019-06-19 2019-09-27 北京工业大学 Traffic police's gesture identification method based on convolution posture machine and long memory network in short-term
CN110458046A (en) * 2019-07-23 2019-11-15 南京邮电大学 A kind of human body motion track analysis method extracted based on artis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991372A (en) * 2017-03-02 2017-07-28 北京工业大学 A kind of dynamic gesture identification method based on interacting depth learning model
WO2019006473A1 (en) * 2017-06-30 2019-01-03 The Johns Hopkins University Systems and method for action recognition using micro-doppler signatures and recurrent neural networks
CN110287844A (en) * 2019-06-19 2019-09-27 北京工业大学 Traffic police's gesture identification method based on convolution posture machine and long memory network in short-term
CN110458046A (en) * 2019-07-23 2019-11-15 南京邮电大学 A kind of human body motion track analysis method extracted based on artis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YUEH WU等: "Applying hand gesture recognition and joint tracking to a TV controller using CNN and Convolutional Pose Machine", 《2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)》 *
卢兴沄: "一种类人机器人手势识别算法及其实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950341A (en) * 2020-06-19 2020-11-17 南京邮电大学 Real-time gesture recognition method and gesture recognition system based on machine vision
CN113196289A (en) * 2020-07-02 2021-07-30 浙江大学 Human body action recognition method, human body action recognition system and device
CN112102451A (en) * 2020-07-28 2020-12-18 北京云舶在线科技有限公司 Common camera-based wearable virtual live broadcast method and equipment
CN112102451B (en) * 2020-07-28 2023-08-22 北京云舶在线科技有限公司 Wearable virtual live broadcast method and equipment based on common camera
CN111881994B (en) * 2020-08-03 2024-04-05 杭州睿琪软件有限公司 Identification processing method and apparatus, and non-transitory computer readable storage medium
CN111881994A (en) * 2020-08-03 2020-11-03 杭州睿琪软件有限公司 Recognition processing method and apparatus, and non-transitory computer-readable storage medium
CN112699837A (en) * 2021-01-13 2021-04-23 新大陆数字技术股份有限公司 Gesture recognition method and device based on deep learning
CN112862096A (en) * 2021-02-04 2021-05-28 百果园技术(新加坡)有限公司 Model training and data processing method, device, equipment and medium
CN113313161A (en) * 2021-05-24 2021-08-27 北京大学 Object shape classification method based on rotation invariant canonical invariant network model
CN113313161B (en) * 2021-05-24 2023-09-26 北京大学 Object shape classification method based on rotation-invariant standard isomorphism network model
CN113269089A (en) * 2021-05-25 2021-08-17 上海人工智能研究院有限公司 Real-time gesture recognition method and system based on deep learning
CN113269089B (en) * 2021-05-25 2023-07-18 上海人工智能研究院有限公司 Real-time gesture recognition method and system based on deep learning
TWI787841B (en) * 2021-05-27 2022-12-21 中強光電股份有限公司 Image recognition method
CN113743247A (en) * 2021-08-16 2021-12-03 电子科技大学 Gesture recognition method based on Reders model
US20230107097A1 (en) * 2021-10-06 2023-04-06 Fotonation Limited Method for identifying a gesture
US11983327B2 (en) * 2021-10-06 2024-05-14 Fotonation Limited Method for identifying a gesture
CN114185429B (en) * 2021-11-11 2024-03-26 杭州易现先进科技有限公司 Gesture key point positioning or gesture estimating method, electronic device and storage medium
CN114185429A (en) * 2021-11-11 2022-03-15 杭州易现先进科技有限公司 Method for positioning gesture key points or estimating gesture, electronic device and storage medium
CN114499712A (en) * 2021-12-22 2022-05-13 天翼云科技有限公司 Gesture recognition method, device and storage medium
CN114499712B (en) * 2021-12-22 2024-01-05 天翼云科技有限公司 Gesture recognition method, device and storage medium
CN115273244B (en) * 2022-09-29 2022-12-20 合肥工业大学 Human body action recognition method and system based on graph neural network
CN115273244A (en) * 2022-09-29 2022-11-01 合肥工业大学 Human body action recognition method and system based on graph neural network
CN116645727B (en) * 2023-05-31 2023-12-01 江苏中科优胜科技有限公司 Behavior capturing and identifying method based on Openphase model algorithm
CN116645727A (en) * 2023-05-31 2023-08-25 江苏中科优胜科技有限公司 Behavior capturing and identifying method based on Openphase model algorithm
CN116974369A (en) * 2023-06-21 2023-10-31 广东工业大学 Method, system, equipment and storage medium for operating medical image in operation
CN116974369B (en) * 2023-06-21 2024-05-17 广东工业大学 Method, system, equipment and storage medium for operating medical image in operation
CN116959120A (en) * 2023-09-15 2023-10-27 中南民族大学 Hand gesture estimation method and system based on hand joints
CN116959120B (en) * 2023-09-15 2023-12-01 中南民族大学 Hand gesture estimation method and system based on hand joints

Also Published As

Publication number Publication date
CN111209861B (en) 2022-03-18

Similar Documents

Publication Publication Date Title
CN111209861B (en) Dynamic gesture action recognition method based on deep learning
CN111191627B (en) Method for improving accuracy of dynamic gesture motion recognition under multiple viewpoints
CN105975931B (en) A kind of convolutional neural networks face identification method based on multiple dimensioned pond
Lim et al. Isolated sign language recognition using convolutional neural network hand modelling and hand energy image
Amor et al. Action recognition using rate-invariant analysis of skeletal shape trajectories
Chaudhary et al. Intelligent approaches to interact with machines using hand gesture recognition in natural way: a survey
CN111695457B (en) Human body posture estimation method based on weak supervision mechanism
CN110458046B (en) Human motion trajectory analysis method based on joint point extraction
CN113221663B (en) Real-time sign language intelligent identification method, device and system
CN113128424B (en) Method for identifying action of graph convolution neural network based on attention mechanism
CN112101262B (en) Multi-feature fusion sign language recognition method and network model
EP4099213A1 (en) A method for training a convolutional neural network to deliver an identifier of a person visible on an image, using a graph convolutional neural network
CN112800990B (en) Real-time human body action recognition and counting method
CN113191243B (en) Human hand three-dimensional attitude estimation model establishment method based on camera distance and application thereof
CN111709268A (en) Human hand posture estimation method and device based on human hand structure guidance in depth image
CN112906520A (en) Gesture coding-based action recognition method and device
CN110163130B (en) Feature pre-alignment random forest classification system and method for gesture recognition
Kowdiki et al. Adaptive hough transform with optimized deep learning followed by dynamic time warping for hand gesture recognition
Ikram et al. Real time hand gesture recognition using leap motion controller based on CNN-SVM architechture
CN111274901B (en) Gesture depth image continuous detection method based on depth gating recursion unit
Memmesheimer et al. Gesture recognition on human pose features of single images
Postnikov et al. Conditioned human trajectory prediction using iterative attention blocks
CN114898464A (en) Lightweight accurate finger language intelligent algorithm identification method based on machine vision
CN114202801A (en) Gesture recognition method based on attention-guided airspace map convolution simple cycle unit
CN114973305A (en) Accurate human body analysis method for crowded people

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant