CN112183198A - Gesture recognition method for fusing body skeleton and head and hand part profiles - Google Patents

Gesture recognition method for fusing body skeleton and head and hand part profiles Download PDF

Info

Publication number
CN112183198A
CN112183198A CN202010851927.1A CN202010851927A CN112183198A CN 112183198 A CN112183198 A CN 112183198A CN 202010851927 A CN202010851927 A CN 202010851927A CN 112183198 A CN112183198 A CN 112183198A
Authority
CN
China
Prior art keywords
skeleton
gesture
human body
human
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010851927.1A
Other languages
Chinese (zh)
Inventor
何坚
廖俊杰
张丞
余立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202010851927.1A priority Critical patent/CN112183198A/en
Publication of CN112183198A publication Critical patent/CN112183198A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/113Recognition of static hand signs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

A gesture recognition method with fused body skeleton and head and hand part outlines belongs to the field of computer vision. The invention provides a method for describing human body gestures by combining human body skeleton gesture characteristics with gesture component outline characteristics. The invention uses the gesture component category identified by the contour detection network to represent the local information of the human body, so that the structure of the human body model is more complete. The invention cuts out a CPM network to construct a human body skeleton key node recognition network KEN, so that the real-time performance is enough, the recognition speed of 15 frames per second can be achieved in actual test, and the recognition precision is higher; the method has the characteristics of less parameter quantity, high operation speed and high identification precision.

Description

Gesture recognition method for fusing body skeleton and head and hand part profiles
Technical Field
The invention relates to a gesture recognition method for fusing limb skeleton and head and hand part profiles, which belongs to the field of electronic information and is a human gesture recognition method based on computer vision and applicable to human-computer interaction.
Background
Gestures are the most important way of non-verbal communication from person to person. Because gestures have characteristics of nature, various forms and the like, recognition of the gestures is an important field of human-computer interaction research. The recognition methods can be classified into contact gesture recognition and vision-based gesture recognition according to whether the gesture recognition apparatus is in contact with a body. The equipment (such as data gloves) used by contact type gesture recognition is complex and high in price, and the user can perform gesture recognition only after being familiar with the corresponding equipment, so that the natural expression of gestures is limited, and the natural interaction is not facilitated. Gesture recognition based on vision does not need expensive equipment, has the advantages of convenience in operation, nature and the like, more accords with the large trend of natural human-computer interaction, and has wide application prospects. The method based on computer vision is easy to realize, but the identification accuracy is easily influenced by factors such as background, illumination or human body gesture motion change. In recent years, the deep learning algorithm has excellent effects in the fields of image recognition, natural language processing and the like, and a new implementation method is provided for human gesture recognition.
Aiming at the problems existing in human body gesture recognition based on computer vision, the invention introduces a convolution gesture machine (CPM) based on deep learning, a Single Shot multiple box Detector (SSD) and Long Short Time Memory (LSTM) to perform human body gesture recognition.
Disclosure of Invention
Aiming at the problems that a human body Gesture recognition method based on computer vision is easily influenced by illumination, background and Gesture dynamic change and the like, the invention combines CPM, SSD and LSTM to construct a human body dynamic Gesture recognition machine (Gesture Recognizer based on Spatial Context and Temporal Feature Fusion, GRSCTFF) to extract the space-time characteristics of human body gestures, thereby realizing the rapid and accurate recognition of the human body gestures;
the invention specifically comprises the following steps:
(1) on the basis of analyzing the spatial context characteristics of the human body gestures, establishing a dynamic gesture model based on human body skeleton and component outline characteristics;
when gesture interaction is adopted, the form of the dynamic gesture mainly comprises the human skeleton form and the contours of the hand and the head, and the essence of the gesture is based on the combination of the relative positions of the skeleton joint points and each skeleton section (such as the length and the angle of the skeleton) and the external forms of the hand and the head. The human skeleton is formed by mutually linking key nodes of the skeleton. In the invention, "human skeleton key nodes" refer to key nodes included in a human skeleton link structure, and "parts" refer to hands, heads and feet with shape and contour characteristics. By taking 3-dimensional human body model thought as a reference, a universal gesture model which integrates the contour characteristics of human body skeletons, hands, heads and other parts is established.
(2) Adopting a convolution gesture machine and a single-shot multi-frame detector technology to construct a deep neural network to extract human body gesture skeleton and part outline characteristics, and combining the human body gesture skeleton and part outline characteristics into human body space context characteristics;
the gesture space context information is composed of a gesture skeleton configuration and a gesture component outline. The gesture skeleton configuration comprises the relative length characteristics of a human skeleton and the angle characteristics relative to the direction of gravitational acceleration, in order to extract the gesture skeleton configuration characteristics, a human skeleton key node extraction network needs to be established, the invention cuts the CPM depth by using the CPM thought, and constructs a human skeleton key node extraction network KEN comprising 3 stages:
setting Z as the set of all position coordinates (i, j) of the human skeleton in the image; using Y as the position of each key node of human skeleton in imagekIndicating that the human skeleton contains a total of 14 key nodes, hence Yk∈{Y1,…,Y14}. KEN is composed of a series of multi-class predictors gtCompositions trained to predict the location of each key node in the same image under different receptive fields. Specifically, gt(. cndot.) is a classifier, and the subscript T ∈ {1, …, T } indicates the stages of classification, each stage having a different receptive field, where T is the last stage of the classifier. gt(. DEG) predicting that the point Z (Z epsilon Z) in the image under the receptive field belongs to a key nodeYkConfidence of (1), using b (Y)kZ) (Z ∈ Z) represents a confidence value, then bT(YkZ) represents the key node confidence for the z-coordinate point currently in the T-phase. These gt(. cndot.) has the same objective function value (i.e., true confidence). When t is>1 time, gt(. is a feature value x extracted from an image position z)zAnd each key node YkAnd (4) splicing functions of the predicted values of the confidence degrees at the moment t-1. After T stages, the position with the highest confidence coefficient is the position of the key node, and argmax represents the key node YkAnd when the confidence coefficient is maximum, acquiring a function of the coordinate point z. Namely:
Yk=argmax(bT(Yk=z)),k∈{1…14} (1)
based on the formula (1), the position of each key node in the human skeleton can be calculated, and a preliminary human skeleton form is established.
Fig. 1 shows a process of extracting gesture skeleton features. Phi is a1(. -) represents a function of the human body skeleton key node to transform the human body limb vector; in addition, since the head height of the human body is a fixed value, it does not change with the rotation of the body and the change of the moving distance of the camera. Therefore, the invention takes the height of the head of the human body as a reference point and introduces a function phi2(. cndot.) represents the vector concatenation of the relative visible lengths of the skeleton segments contained in the human skeleton, i.e., equation (2).
Figure BDA0002645011500000031
Wherein 11 represents that the human skeleton model of the invention totally comprises 11 segments of human limb skeletons, viIs the ith skeleton of human body gesture, VheadIs a head skeleton vector representing the vertex to neck center, and | represents the vector mode, i.e., the length of the head skeleton.
Figure BDA0002645011500000032
Representing vector stitching. The formula is expressed as VheadFor reference, the length of each limb skeleton model is divided by VheadDie length ofThe visible length of each skeleton relative to the head skeleton is calculated.
In addition, because the direction of the gravitational acceleration is always vertical to the ground, in order to describe the direction of each skeleton segment in the human skeleton relative to the ground, the invention introduces the included angle between the skeleton and the gravitational acceleration. And use phi3And (c) the vector splicing of each framework and the included angle of the gravity direction is shown, namely formula (3).
Figure BDA0002645011500000033
The invention describes the angle characteristics of the framework by using the trigonometric function value of the framework and the gravity acceleration direction. In the formula (3), d represents a unit vector, and the direction is the same as the gravity direction.
Figure BDA0002645011500000034
Calculating cos value of an included angle between each skeleton vector and the gravity direction,
Figure BDA0002645011500000035
the sin value is calculated. 2 spatial context characteristics contained in the human skeleton, namely the relative visible length V of the skeleton, are extracted through the stepslAngle V between skeleton and gravity directiona. Here, the shape feature of the human body gesture skeleton is represented by a bone, which may be Vl∪Va
The gesture component outline comprises the class characteristics of the head and the hand of a human body, and in order to obtain the class of the component outline, the invention uses the SSD thought for reference, and uses MobileNet with less parameter quantity to replace a feature extraction network VGGNet in the SSD, so as to construct a gesture component outline feature extraction network GPEN:
GPEN detects and classifies the part contour features on the multi-scale convolution feature map. For each unit on each scale convolution feature map, GPEN has anchor boxes with different scales and length-width ratios to predict the gesture component in the box, and the position of the prediction box and the category confidence of the outlines of different gesture components are generated. Let S be GPEN recognize from imageWherein L represents the position information of the part outline prediction frame, and is composed of the coordinates of the center point of the prediction frame, the width and the height of the prediction frame; c represents a set of confidence levels for predicting the object contours contained in the prediction box into different part contour classes. E.g. ciRepresenting the confidence that a part-outline belongs to a part-outline of the i-th class, i.e. ci∈C。
For each part profile p(s)pE S) with position information of lpClass confidence set is Cp(ii) a Hypothesis CpIf the part contour corresponding category with the maximum middle confidence value is M and M is the complete set of gesture part contour categories, the category of p is set as M (M belongs to M) and the confidence value is cm(cm∈Cp) At this time spHas a characteristic value ofm,cm). By analogy, the feature value set S for all component contours in an image is (L)m,Cm). According to a preset confidence coefficient threshold value cth(the present invention is set to 0.5 below which the part is not considered as being part-contoured) is removed from SmLower than cthWhile the elements in S are sorted in descending order of confidence values, they form the final set G of part contours. The following 3 steps were repeated:
1) taking the confidence value c in GmThe highest component, calculated with the other components in G, respectively, according to equation (4), where J (l)m,lother) Representing the degree of overlap of the component with the contours of the other component,/mAs a positional feature of the contour of the part,/otherThe position characteristics of the outline of other parts.
Figure BDA0002645011500000041
2) Identifying overlap threshold for the same part profile as JthThe present invention is set to 0.5, i.e., overlapping coverage of more than 50% is considered the same part profile, so when J (l)m,lother) Higher than JthWhen it is needed, aotherCorresponding component feature sotherAnd deleted from G.
3) When the above operation is completed on the sorted component set G, l is addedmCorresponding component feature smDelete from G and output smCorresponding to (l)m,cm) The value is obtained. m belongs to the category to determine that the part profile belongs to the left-hand profile characteristic Sleft(or as a right hand profile feature SrightOr head contour features Shead)。
Repeating the steps until the set G is empty, and finally obtaining the profile characteristic S of the left-hand partleftRight hand profile part feature SrightHead profile part feature Shead. On the basis, the human skeleton feature bone of the gesture and the contour feature S of the left hand part are obtained through a formula (5)leftRight hand profile part feature SrightHead profile part feature SheadThe stitching constitutes the spatial context feature F of the gesture. Namely:
G=φ4(bone,Sleft,Sright,Shead) (5)
Figure BDA0002645011500000042
(3) a long-time and short-time memory network is introduced to extract time sequence characteristics of skeleton, left hand, right hand and head outlines in the dynamic human body gestures, human body space context characteristics are fused, the gestures are further classified and recognized, and the construction of GRSCTFF is completed;
in dynamic gesture recognition, the gesture type is not only related to the current gesture characteristics, but also to previous gesture characteristics. f. ofclsClassification represents the classification of human body gestures as a gesture classification function, F0Representing the spatial context characteristic of the body at time 0, F1Representing spatial context characteristics of the human body at time 1, FτAnd the gesture type at the current time is obtained according to the formula (6).
classification=fcls(F0,F1,…,Fτ) (6)
Equation (6) illustrates that in order to accurately identify the current dynamic gesture category, a structure is needed to preserve the spatial context characteristics of the previous gesture. Therefore, the present invention introduces an LSTM network to associate spatial features in the dynamic gesture with temporal order and ultimately complete the dynamic gesture classification.
The invention is mainly characterized in that:
(1) the invention provides a method for describing human body gestures by combining human body skeleton gesture characteristics with gesture component outline characteristics. The gesture of the human body is represented only by using the posture characteristics of the human body skeleton, and the gesture category is easily judged by mistake when the human body is in a shielded state; if only the part contour is used for representing the human body gesture, the apparent characteristics of the human body part, such as color, texture, edge information and the like, are excessively depended, and certain limitation is realized. As shown in formula (5) in the invention content, the technical difficulty of the invention lies in representing the human body integral model by the contour characteristics of the human body skeleton characteristic splicing gesture part, wherein the human body skeleton characteristics are calculated by formula (3), and creatively proposes that the normalized relative length relative to the head to the neck of the human body and the relative angle of the limb relative to the gravity acceleration direction are used as the human body skeleton characteristics; in addition, the contour characteristics of the gesture component are obtained by a formula (4), and the gesture component category identified by the contour detection network is used for representing the local information of the human body, so that the human body model structure is more complete.
(2) The invention cuts the CPM network to construct the key node recognition network KEN of the human skeleton, so that the network KEN has enough real-time performance, can achieve the recognition speed of 15 frames per second in practical test, and has higher recognition precision at the same time, the network model has the advantages of not depending on specific human space constraint, and breaking the limitation that the traditional human gesture recognition network depends on the apparent characteristics of human parts; on the other hand, the gesture component contour feature extraction network GPEN provided by the invention improves the size of the anchoring frame according to the contour of the human body component in the data set, so that the anchoring frame has enough detection capability when corresponding to smaller contours such as hands, heads and the like, the features extracted by the gesture component contour network can well supplement and describe the human body when part of the limb skeleton of the human body is shielded, and meanwhile, the gesture component contour feature extraction network GPEN has the advantages of less network model parameters and high detection speed; and finally, the LSTM network is adopted to extract the time sequence characteristics of the human dynamic gestures, so that time sequence association in the human dynamic gesture performing process is constructed, and the problem that the dynamic continuous gestures are misjudged due to independent human gestures is solved.
(3) The invention designs and realizes the human body dynamic gesture recognition machine GRSCTFF, so that the human body dynamic gesture recognition machine GRSCTFF can accurately recognize the category of the human body dynamic gesture in various complex scenes, and solves the problems that the human body gesture recognition method based on computer vision is easily influenced by illumination, background and gesture dynamic change and the like; the GRSCTFF network model has the characteristics of less parameter quantity, high operation speed and high identification precision.
Drawings
FIG. 1 is a gesture skeletal feature extraction process;
FIG. 2(a) is a human gesture;
FIG. 2(b) is a diagram of a human gesture corresponding to a skeletal key node and a skeletal vector;
FIG. 2(c) is a human gesture corresponding to a gesture component outline;
FIG. 3 is a KEN network architecture;
figure 4 is a GPEN network architecture;
FIG. 5 is a convolution process of a depth separable convolution;
FIG. 6 is a scatter plot of feature size ratios
Fig. 7 is an LSTM network architecture.
Detailed Description
The invention adopts the following technical scheme and implementation steps:
(1) on the basis of analyzing the spatial context characteristics of the human body gestures, establishing a dynamic gesture model based on human body skeleton and component outline characteristics;
FIG. 2(a) shows a generic human gesture model. In order to recognize the gesture, it is necessary to recognize the human skeleton (b) and its head and contour features (c) of the left and right hands.
The skeleton of the human body can be abstracted into 14 key nodes and connecting lines thereof, as shown in fig. 2 (a). The coordinate sets of these key nodes in FIG. 2(b) are Y, Y1Denotes No. 1 human body key node, whichThe rest serial numbers of the key nodes of the human body are analogized, and Y is (Y)1,Y2,…,Y14). V represents the set of connection dependencies existing between adjacent key nodes in Y, i.e. the human limb skeleton, which is represented by the head skeleton V as shown in FIG. 2(b)headUpper body skeleton VupperAnd lower body skeleton VlowerAnd 3, part composition. Namely:
Figure BDA0002645011500000061
v is a key node connection (i.e. V ∈ V), and the starting key node and the ending key node are respectively YaAnd YbThen, then
Figure BDA0002645011500000071
A skeleton vector contained by the human skeleton is represented. Similar to the key node classification method, the human gesture component mainly includes a head and a hand, wherein the hand includes a left hand and a right hand, as shown in fig. 2 (c). The human body gesture model can be completely described by fusing the gesture part outline shown in fig. 2 with the human body skeleton gesture.
(2) The spatial context feature extraction module is designed and realized in a specific mode that a deep neural network is constructed by adopting a convolution posture machine and a single-shot multi-frame detector technology to extract human body gesture skeleton and part outline features, and the human body gesture skeleton and part outline features are combined into human body spatial context features.
The space context feature extraction module mainly comprises two parts, wherein one part is the design and implementation of a human skeleton key node identification network KEN, and the other part is the design and implementation of a gesture component outline feature extraction network GPEN, and the design and implementation are as follows:
1) design and realization of human skeleton key node identification network KEN:
the conventional CPM outputs 15 hot spot maps while monitoring human activities. Wherein, 14 hot spot graphs correspond to corresponding key nodes of a human body, and the other 1 hot spot graph is a background hot spot graph. The invention uses the CPM idea for reference and supplements the incidence relation among key nodes of the human skeleton at the output end. Meanwhile, in order to support gesture real-time recognition and cut the depth of the CPM, a human body key node extraction network KEN comprising 3 stages is constructed, and FIG. 3 is a network architecture thereof.
In fig. 3, C denotes a convolution layer, K denotes a convolution kernel size, OC denotes the number of output channels, and t denotes a stage. The KEN adopts the first 10 layers of the VGG-19 network as an image feature extraction network to process an input image, and the sizes of convolution kernels from the 1 st layer to the 10 th layer are respectively as follows: 3 × 3 × 64, 3 × 03 × 164, 3 × 23 × 3128, 3 × 43 × 5128, 3 × 63 × 7512, 3 × 83 × 9512, 3 × 3 × 0256, 3 × 3 × 256, and the convolution layers of the 2 nd layer, the 4 th layer, and the 6 th layer are sequentially subjected to maximum pooling, the convolution kernel size of all the maximum pooling is 2 × 2 and the step size is 2, and finally the characteristic x of the image is obtained by the above-mentioned processingz. The KEN tailors the CPM network depth and implements a classifier comprising 3 stages. Wherein Z is the set of all position coordinates of the human skeleton in the image, and when t is 1, Stage1 uses the image characteristic xzAs input, a classifier g of stage1 is implemented1(·),g1(. 2) including 2 branches, the convolutional layer parameter of 2 branches is identical, every branch all contains 5 convolutional layers, in which the convolutional kernel size of every layer is respectively 3X 128, 3X 128, 1X 512, 1X 8 according to the precedence order, the first branch outputs joint point YkConfidence set b of1(YkZ) (Z ∈ Z), the second branch outputs a set of skeletons L1; when t is 2, stage 2 takes the image feature xzAnd joint point Y output in stage1kConfidence set b1(YkZ) (Z ∈ Z) and a set of skeletons L1 as inputs, implementing stage 2 classifier g2(·),g2(. 2) including 2 branches, the convolutional layer parameter of 2 branches is identical, and each branch all contains 7 convolutional layers, and the convolutional kernel size of each layer is respectively 7 × 7 × 128, 7 × 07 × 1128, 7 × 27 × 3128, 7 × 7 × 128, 1 × 1 × 8 according to the precedence order, and the first branch outputs joint point YkConfidence set b of2(YkZ) (Z ∈ Z), the second branch outputs a set of skeletons L2; when t is 3, stage 3 takes the imageCharacteristic xzAnd joint point Y output in stage 2kConfidence set b2(YkZ) (Z ∈ Z) and a skeleton set L2 as inputs, implementing stage 3 classifier g3(·),g3(g) a build-up layer structure and2(. o) complete coincidence, the first branch outputting the joint point YkConfidence set b of3(YkZ) (Z ∈ Z), the second branch outputs a skeleton set L3.
KEN comprises 3 cost functions which respectively calculate the joint point confidence degree sets b output by the 3 stages1(Yk=z)、b2(Yk=z)、b3(YkZ) with true confidence b*(YkZ), by which the problem of gradient disappearance during network training can be prevented. The total error of the system due to KEN can be calculated according to equation (8).
Figure BDA0002645011500000081
In the formula, the first step is that,
Figure BDA0002645011500000082
is the prediction confidence of the jth key node in the human skeleton on z,
Figure BDA0002645011500000083
is the true confidence of the jth key node in the human skeleton.
The KEN network is trained by using a human body key node data set disclosed by AI Changler as a training sample. In the training of the human skeleton feature extraction network KEN, the batch value is 15; the gradient descent adopts an Adam optimizer, the learning rate of the Adam optimizer is 0.0008, and the exponential decay rate of each 20000 steps is 0.8.
2) Design and realization of a gesture part outline feature extraction network GPEN:
because the labeled data in the data set are relatively less, the phenomenon of overfitting is easily caused by directly using the SSD network for training. In order to relieve the occurrence of the over-fitting phenomenon and reduce the parameter quantity of the network model, the method adopts MobileNet with less parameter quantity to replace a feature extraction network VGGNet in the SSD, and further constructs a gesture component outline feature extraction network GPEN. Fig. 4 is a network structure of GPEN.
In fig. 4, the layer 0 convolutional layer (Conv0) of GPEN uses a conventional convolutional kernel, which has a size of 3 × 3 × 32; the image feature extraction network section in GPEN, i.e., the 1 st to 13 th convolutional layers (Conv1-Conv13) is constructed based on a stacking technique of depth-separable convolutions, each set of depth-separable convolutions including one single-depth convolution kernel and one single-point convolution kernel, the sizes of the single-depth convolution kernels of Conv1-Conv13 are respectively 3 × 3 × 32, 3 × 3 × 64, 3 × 3 × 128, 3 × 3 × 256, 3 × 3 × 512, 3 × 3 × 1024, and the sizes of the single-point convolution kernels of Conv1-Conv13 are respectively 1 × 1 × 32 × 64 × 128, 1 × 1 × 128 × 256, 1 × 1 × 512, 512 × 512, 1 × 1 × 32 × 512, 3, 1 × 1 × 512 × 512, 1 × 1 × 512 × 1024, 1 × 1 × 1024 × 1024. Then, the sizes of convolution kernels of 8 convolutional layers in total, i.e., Conv14_1 to Conv17_2, were 1 × 1 × 1024 × 256, 3 × 3 × 256 × 512, 1 × 1 × 512 × 128, 3 × 3 × 128 × 256, 1 × 1 × 256 × 64, and 3 × 3 × 64 × 128, respectively, in chronological order.
The depth separable convolution in Conv1-Conv13 separates the channel correlation from the spatial correlation and replaces the conventional convolution kernel with a depth separable convolution kernel, thus greatly reducing the amount of parameters in the network, the complete convolution process of which is shown in FIG. 5. Wherein M is the number of image input channels, OC is the number of output channels, K represents the size of the convolution kernel, K × K represents the size of the convolution kernel, and DF×DFSize of input feature graph, DL×DLIndicating the size of the output signature. The ratio of the characteristic parameters of the depth separable convolution kernel to the conventional convolution kernel is calculated by equation (9).
Figure BDA0002645011500000091
Because OC is larger when the convolutional network extracts the image characteristics, the term value
Figure BDA0002645011500000092
And can be ignored. If a common convolution kernel (size 3 × 3) is used, the term values
Figure BDA0002645011500000093
Has a value of
Figure BDA0002645011500000094
It can be seen that the depth separable convolution can greatly reduce the feature parameters, thereby preventing the network from being over-fitted.
The loss function of GPEN network training is composed of classification loss and positioning loss, as shown in equation (10).
Figure BDA0002645011500000095
Wherein L isconfThe method is a classification loss function, and the invention uses a Softmax loss function; l isLocThe method is a positioning loss function of a prediction box, and the method uses a smooth L1 loss function; alpha is a weight coefficient of positioning loss, and is set to be 1; n is the number of samples of input GPEN. x is the category matching information of the current prediction box; g is the true value of the detection box; the other variables are consistent with GPEN in the invention content, represent the position information (L belongs to L) of the prediction frame of the part outline, and are composed of the coordinate of the center point of the prediction frame, the width and the height of the prediction frame; c represents the confidence (C ∈ C) of predicting the object contour contained in the prediction box into a different part contour class;
before network training, GPEN needs to optimize an SSD anchoring box according to the contour characteristics of hands and heads in human gestures. The scatter plot of the left-right hand and head contour scale proportion of the public traffic police gesture data set labeled video sample adopted by the invention is shown in fig. 6. In fig. 6, the abscissa represents the ratio of the width of the part outline marking frame to the width of the entire image; the ordinate represents the proportion of the height of the part outline marking frame to the height of the whole image. As can be seen from fig. 6, the ratio of the height of the component labeling frame to the height of the original image is less than 0.25, the ratio of the width of the component labeling frame to the width of the original image is less than 0.20, and the normalized dimension of the component labeling frame is between 0.05 and 0.25. In order to train GPEN, the normalized scale value of the anchor box is between 0.05 and 0.3. Both GPEN and original SSD networks contain 6 feature layers including anchor boxes, and the normalized scale of the anchor boxes on each feature layer is shown in the following table:
Figure BDA0002645011500000101
in the GPEN network training part, the image feature extractor of the gesture component outline feature extraction network is changed from VGGNet to MobileNet during network model design, so that a MobileNet pre-training model provided by Google is directly loaded to GPEN. The GPEN after the pre-training model is changed takes a human body gesture video frame data set as a sample for training, and the batch value is 24. In the training process, the loss function value is continuously reduced through random gradient descent and a back propagation mechanism, so that the position of the anchor frame approaches to the position of a real frame, and the classification confidence coefficient is improved. After 120000 steps of accumulative training, network convergence and system accuracy do not change any more, the model is stored for recognizing and extracting human body gesture part contour characteristics, and the relative length and angle characteristic data of the skeleton in the human body gesture calculated by combining the KEN network are combined to obtain the spatial context characteristics.
(3) And (3) extracting a network by using time sequence characteristic features and classifying dynamic gestures.
According to key nodes output by the KEN and the incidence relation among the nodes, the relative length of each skeleton section in the human skeleton and the included angle between the relative length and the gravity acceleration can be respectively calculated, and the tau-moment human gesture space context feature F can be generated by combining the left hand contour, the right hand contour and the head contour category output by the GPENτ
Obtaining human body gesture space context characteristics FτThe LSTM network is then used to extract timing characteristics of the dynamic gesture. Fig. 7 shows the architecture of an LSTM network used in the present invention. In the context of figure 7 of the drawings,eτ-1、hτ-1and FτIs the input to the LSTM network. Wherein, FτThe characteristic value of the synthesis (concat) of the relative length of each skeleton in the human body skeleton at the time tau, the included angle between the relative length and the gravity acceleration, the left hand contour type, the right hand contour type and the head contour type. In addition, when tau is the initial time, the system randomly generates an initial value e0And h0,h0As a time-series characteristic of dynamic gestures, e0For LSTM network memory preservation, eτAnd hτIs the output of the network and is taken as tau>1 time input of the next time of the LSTM network. Wherein "sigmoid", "tanh" and "softmax" denote the activation function, PτRepresenting network output h at time ττThe probability of a dynamic gesture category is obtained by activating the function "softmax". When the network is trained, cross entry function is adopted to calculate network loss, and a truncation back propagation algorithm is adopted to avoid the problem of gradient disappearance in the training.
The invention adopts Xavier to carry out initialization setting on the neurons in the LSTM network and adopts a truncation back propagation algorithm to train the LSTM network. In training, a human body gesture video data set is randomly cut into small pieces of videos with the length of 90 seconds, and 128 small pieces of videos are assembled to form a batch. The learning rate of LSTM is 0.0004 and the gradient descent algorithm employs an Adam optimizer. The LSTM network training is stopped after 50000 steps of accumulated training.
And after the GRSCTFF dynamic gesture recognition machine is built, training by adopting a public data set. The human skeleton key node network KEN adopts a human body key node data set disclosed by AI Challenger as a training sample, and the gesture part outline feature extraction network GPEN adopts a training set of a public traffic police gesture data set for training.
And finally, GRSCTFF adopts a test set of a public traffic police gesture data set to carry out experimental verification, the accuracy of human body gesture recognition is calculated through the editing distance, namely the minimum editing times required by converting the gesture information predicted and recognized by the model into the real labeled gesture information are calculated, and the editing distance is calculated according to the formula (11). Where Accuracy represents the Accuracy, H is the total number of poses in the video, I is the total number of inserted poses in the video, D is the total number of deleted poses in the system, and P is the total number of replacement poses in the system.
Figure BDA0002645011500000111
Experiments prove that GRSCTFF can quickly and accurately identify human gestures, the accuracy of the system reaches 94.12%, and the system has strong anti-interference capability on light, background and human gesture position changes.

Claims (2)

1. The gesture recognition method with the fused body skeleton and head and hand part profiles is characterized in that:
(1) on the basis of analyzing the spatial context characteristics of the human body gestures, establishing a dynamic gesture model based on human body skeleton and component outline characteristics;
when gesture interaction is adopted, a universal gesture model fusing the contour characteristics of the human skeleton, the hand and the head part is established;
(2) adopting a convolution gesture machine and a single-shot multi-frame detector technology to construct a deep neural network to extract human body gesture skeleton and part outline characteristics, and combining the human body gesture skeleton and part outline characteristics into human body space context characteristics;
the gesture space context information is composed of a gesture skeleton configuration and a gesture part outline; the gesture skeleton configuration comprises relative length characteristics of a human skeleton and angle characteristics relative to the direction of gravitational acceleration, and a human skeleton key node extraction network KEN comprising 3 stages is constructed:
setting Z as the set of all position coordinates (i, j) of the human skeleton in the image; using Y as the position of each key node of human skeleton in imagekIndicating that the human skeleton contains a total of 14 key nodes, hence Yk∈{Y1,…,Y14}; KEN is composed of a series of multi-class predictors gtComponents trained to predict the location of each key node in the same image under different receptive fields; specifically, gt(. cndot.) is a classifier, and subscript T is epsilon {1, …, T } to represent the classified stages, and the receptive field of each stage is different, wherein T is the last stage of the classifier; gt(. to) predict the point z in the image under the receptive field to belong to the key node YkWherein Z ∈ Z, with b (Y)kZ) represents a confidence value, then bT(YkZ) represents the confidence of the key node of the z coordinate point when the current is in the T stage; these gt(. h) has the same objective function value, i.e. true confidence; when t is>1 time, gt(. is a feature value x extracted from an image position z)zAnd each key node YkSplicing functions of the predicted values of the confidence degrees at the moment t-1; after T stages, the position with the highest confidence coefficient is the position of the key node, and argmax represents the key node YkWhen the confidence coefficient is the maximum value, a function of the coordinate point z is obtained; namely:
Yk=argmax(bT(Yk=z)),k∈{1…14} (1)
calculating the position of each key node in the human skeleton based on the formula (1), and establishing a preliminary human skeleton form;
introducing a function phi by taking the height of the head of a human body as a reference point2(. 2) represents the vector splicing of the relative visible lengths of the skeleton segments contained in the human skeleton;
Figure FDA0002645011490000021
wherein 11 represents that the human skeleton model totally comprises 11 segments of human limb skeletons, viIs the ith skeleton of human body gesture, VheadIs a head skeleton vector representing the vertex to neck center, | | represents the vector mode, i.e., the length of the head skeleton;
Figure FDA0002645011490000022
representing vector stitching; the formula is expressed as VheadFor reference, the length of each limb skeleton model is divided by VheadCalculating the visible length of each skeleton relative to the head skeleton;
in addition, since the direction of gravitational acceleration is always perpendicular to the ground, in order to describe each bone in the human skeletonThe direction of the frame section relative to the ground introduces an included angle between the frame and the gravity acceleration; and use phi3The vector splicing of each framework and the included angle of the gravity direction is shown in the equation (3);
Figure FDA0002645011490000023
describing the angle characteristics of the framework by using a trigonometric function value of the framework and the gravity acceleration direction; in the formula (3), d represents a unit vector, and the direction is the same as the gravity direction;
Figure FDA0002645011490000024
calculating cos value of an included angle between each skeleton vector and the gravity direction,
Figure FDA0002645011490000025
calculating the sin value of the product; 2 spatial context characteristics contained in the human skeleton, namely the relative visible length V of the skeleton, are extracted through the stepslAngle V between skeleton and gravity directiona(ii) a Here, the shape feature of the human body gesture skeleton is represented by a bone, which may be Vl∪Va
Constructing a gesture component outline feature extraction network GPEN:
setting S as a set of component contour characteristic values (L, C) recognized from the image by GPEN, wherein L represents the position information of a component contour prediction frame and consists of the coordinates of the center point of the prediction frame, the width and the height of the prediction frame; c represents a confidence set for predicting the object contour contained in the prediction frame into different part contour categories; c. CiRepresenting the confidence that a part-outline belongs to a part-outline of the i-th class, i.e. ci∈C;
For each part profile p there is spBelongs to S and has position information of lpClass confidence set is Cp(ii) a Hypothesis CpSetting the category of p as M if the corresponding category of the part contour with the maximum middle confidence value is M and M is a complete set of gesture part contour categories, wherein M belongs to M and the confidence value is cmWherein c ism∈CpAt this time spHas a characteristic value ofm,cm) (ii) a By analogy, the feature value set S for all component contours in an image is (L)m,Cm) (ii) a According to a preset confidence coefficient threshold value cthConfidence threshold cthSet to 0.5, below which value c is removed from S and not considered part profilemLower than cthThe elements in S are sorted in descending order according to confidence value, and they form the final part outline set G; the following 3 steps were repeated:
1) taking the confidence value c in GmThe highest component, calculated with the other components in G, respectively, according to equation (4), where J (l)m,lother) Representing the degree of overlap of the component with the contours of the other component,/mAs a positional feature of the contour of the part,/otherPosition features that are profiles of other components;
Figure FDA0002645011490000031
2) identifying overlap threshold for the same part profile as JthSet to 0.5, i.e., overlap coverage over 50%, then consider the same part profile, so when J (l)m,lother) Higher than JthWhen it is needed, aotherCorresponding component feature sotherDeleting from G;
3) when the above operation is completed on the sorted component set G, l is addedmCorresponding component feature smDelete from G and output smCorresponding to (l)m,cm) A value; m belongs to the category to determine that the part profile belongs to the left-hand profile characteristic Sleft(or as a right hand profile feature SrightOr head contour features Shead);
Repeating the steps 1) to 3) until the set G is empty, and finally obtaining the profile characteristic S of the left-hand partleftRight hand profile part feature SrightHead profile part feature Shead(ii) a On the basis of the above, throughThe human skeleton characteristic bone of the gesture and the profile characteristic S of the left hand part are expressed in the formula (5)leftRight hand profile part feature SrightHead profile part feature SheadSplicing the spatial context characteristics F forming the gesture; namely:
Figure FDA0002645011490000041
(3) a long-time and short-time memory network is introduced to extract time sequence characteristics of skeleton, left hand, right hand and head outlines in the dynamic human body gestures, human body space context characteristics are fused, the gestures are further classified and recognized, and the construction of GRSCTFF is completed;
in dynamic gesture recognition, the gesture type is not only related to the current gesture characteristics, but also related to the previous gesture characteristics; f. ofclsClassification represents the classification of human body gestures as a gesture classification function, F0Representing the spatial context characteristic of the body at time 0, F1Representing spatial context characteristics of the human body at time 1, FτThe human body space context characteristics at the time of tau are represented, so that the gesture type of the current time is obtained according to a formula (6);
classification=fcls(F0,F1,…,Fτ) (6)
2. the method of claim 1, wherein:
the skeleton abstraction of human body is 14 key nodes and their connection lines, and the coordinate sets of these key nodes are Y, Y1The key node of No. 1 human body is shown, the key nodes of other serial numbers are analogized, and Y is (Y)1,Y2,…,Y14) (ii) a V represents the set of connection dependencies existing between adjacent key nodes in Y, namely the human body limb skeleton which is formed by the head skeleton VheadUpper body skeleton VupperAnd lower body skeleton Vlower3, forming a part; namely:
Figure FDA0002645011490000042
v is a key node connection (i.e. V ∈ V), and the starting key node and the ending key node are respectively YaAnd YbThen, then
Figure FDA0002645011490000051
A skeleton vector contained in the human skeleton is represented; similar to the key node classification method, the human gesture component mainly comprises a head and a hand, wherein the hand comprises a left hand and a right hand, and the human gesture model is completely described by fusing the outline of the gesture component with the posture of a human skeleton;
(2) the design of the space context feature extraction module is realized by adopting a convolution gesture machine and a single-shot multi-frame detector technology to construct a deep neural network to extract the human body gesture skeleton and part outline features, and combining the human body gesture skeleton and part outline features into human body space context features;
the space context feature extraction module comprises two parts, wherein one part is the design and implementation of the human skeleton key node identification network KEN, and the other part is the design and implementation of the gesture component outline feature extraction network GPEN, and the design and implementation are as follows:
1) design and realization of human skeleton key node identification network KEN:
outputting 15 hot spot graphs when monitoring human activities; wherein, 14 hotspot graphs correspond to corresponding key nodes of a human body, and the other 1 hotspot graph is a background hotspot graph; the incidence relation among key nodes of the human skeleton is supplemented at the output end; meanwhile, in order to support gesture real-time recognition and cut the CPM depth, a human body key node extraction network KEN comprising 3 stages is constructed;
c represents a convolution layer, K represents the size of a convolution kernel, OC represents the number of output channels, and t represents a stage; the KEN adopts the first 10 layers of the VGG-19 network as an image feature extraction network to process an input image, and the sizes of convolution kernels from the 1 st layer to the 10 th layer are respectively as follows: 3X 64, 3X 03X 164, 3X 23X 3128, 3X 43X 5128, 3X 63X 7512, 3X 83X 9512, 3X 256, and 3X 256, and in addition, in the 2 nd layerThe convolution layers of the 4 th layer and the 6 th layer are followed by maximum value pooling, the size of convolution kernel of all the maximum value pooling is 2 multiplied by 2 and the step length is 2, and finally the characteristic x of the image is obtained through the processingz(ii) a The KEN cuts the CPM network depth and realizes a classifier comprising 3 stages; wherein Z is the set of all position coordinates of the human skeleton in the image, and when t is 1, Stage1 uses the image feature xzAs input, a classifier g of stage1 is implemented1(·),g1(. 2) including 2 branches, the convolutional layer parameter of 2 branches is identical, every branch all contains 5 convolutional layers, in which the convolutional kernel size of every layer is respectively 3X 128, 3X 128, 1X 512, 1X 8 according to the precedence order, the first branch outputs joint point YkConfidence set b of1(YkZ), Z ∈ Z second branch of which outputs skeleton set L1; when t is 2, stage 2 takes the image feature xzAnd joint point Y output in stage1kConfidence set b1(YkZ) and a skeleton set L1 as inputs, implementing stage 2 classifier g2(·),g2(. 2) including 2 branches, the convolutional layer parameter of 2 branches is identical, and each branch all contains 7 convolutional layers, and the convolutional kernel size of each layer is respectively 7 × 7 × 128, 7 × 07 × 1128, 7 × 27 × 3128, 7 × 7 × 128, 1 × 1 × 8 according to the precedence order, and the first branch outputs joint point YkConfidence set b of2(YkZ), the second branch outputs a skeleton set L2; when t is 3, stage 3 takes the image feature xzAnd joint point Y output in stage 2kConfidence set b2(YkZ) and a skeleton set L2 as inputs, implementing stage 3 classifier g3(·),g3(g) a build-up layer structure and2(. o) complete coincidence, the first branch outputting the joint point YkConfidence set b of3(YkZ), the second branch outputs a skeleton set L3;
KEN comprises 3 cost functions which respectively calculate the joint point confidence degree sets b output by the 3 stages1(Yk=z)、b2(Yk=z)、b3(YkZ) with true confidence b*(YkZ) the euclidean distance between; the total error of the system generated by the KEN is calculated according to a formula (8);
Figure FDA0002645011490000061
in the formula, the first step is that,
Figure FDA0002645011490000071
is the prediction confidence of the jth key node in the human skeleton on z,
Figure FDA0002645011490000072
is the true confidence of the jth key node in the human skeleton;
in the training of the human skeleton feature extraction network KEN, the batch value is 15; the gradient descent adopts an Adam optimizer, the learning rate of the Adam optimizer is 0.0008, and the exponential decay rate of each 20000 steps is 0.8;
2) design and realization of a gesture part outline feature extraction network GPEN:
constructing a gesture component outline feature extraction network GPEN;
layer 0 convolutional layer Conv0 of GPEN uses convolutional kernels, the size of which is 3 × 3 × 32; the image feature extraction network section in GPEN, i.e., the 1 st to 13 th convolutional layers, i.e., Conv1-Conv13, is constructed based on a stacking technique of depth-separable convolutions, each set of depth-separable convolutions including one single-depth convolution kernel and one single-point convolution kernel, the sizes of the single-depth convolution kernels of Conv1-Conv13 are respectively 3 × 3 × 32, 3 × 3 × 64, 3 × 3 × 128, 3 × 3 × 256, 3 × 3 × 512, 3 × 3 × 1024, and the sizes of the single-point convolution kernels of Conv1-Conv13 are respectively 1 × 1 × 32 × 64, 1 × 64 × 128, 1 × 1 × 128 × 128, 1 × 1 × 128 × 256, 1 × 1 × 512, 512 × 1 × 512, and 3 × 512, 3 × 512, 1 × 1 × 512 × 512, 1 × 1 × 512 × 1024, 1 × 1 × 1024 × 1024; then the sizes of convolution kernels of 8 convolutional layers in total, namely Conv14_1 to Conv17_2, are respectively 1 × 1 × 1024 × 256, 3 × 3 × 256 × 512, 1 × 1 × 512 × 128, 3 × 3 × 128 × 256, 1 × 1 × 256 × 64 and 3 × 3 × 64 × 128 in sequence;
m is the number of image input channels, OC is the number of output channels, K represents the size of the convolution kernel, K multiplied by K represents the size of the convolution kernel, DF×DFSize of input feature graph, DL×DLRepresenting the size of the output feature map; the characteristic parameter ratio of the depth separable convolution kernel to the traditional convolution kernel is calculated by a formula (9);
Figure FDA0002645011490000081
the loss function of the GPEN network training is composed of classification loss and positioning loss, and is shown in formula (10);
Figure FDA0002645011490000082
wherein L isconfIs a classification loss function, and uses a Softmax loss function; l isLocThe positioning loss function of the prediction box uses smooth L1 loss function; α is a weight coefficient of the positioning loss, set to 1; n is the number of samples of input GPEN; x is the category matching information of the current prediction box; g is the true value of the detection box; l represents the position information of the part contour prediction frame, wherein L belongs to L and consists of the coordinate of the center point of the prediction frame, the width and the height of the prediction frame; c represents the confidence coefficient of predicting the object contour contained in the prediction box into different part contour categories, wherein C belongs to C;
GPEN contains 6 feature layers including anchor boxes, with the normalized scale of the anchor boxes on each feature layer as follows:
Figure FDA0002645011490000083
the GPEN network training part directly loads a Mobile Net pre-training model provided by Google company to GPEN because an image feature extractor of a gesture component outline feature extraction network is changed from VGGNet to Mobile Net during network model design; the GPEN after the pre-training model is changed takes a human body gesture video frame data set as a sample for training, and the batch value is 24; in the training process, loss function values are continuously reduced through random gradient descent and a back propagation mechanism, so that the position of the anchor frame approaches to the position of a real frame, and the classification confidence coefficient is improved; training until network convergence and system accuracy do not change any more, storing the model for identifying and extracting human body gesture part contour features, and obtaining space context features by combining relative length and angle feature data of a skeleton in a human body gesture calculated by a KEN network;
(3) extracting a network and classifying dynamic gestures according to the time sequence characteristic;
respectively calculating the relative length of each skeleton section in the human skeleton and the included angle between the relative length and the gravity acceleration according to key nodes output by the KEN and the incidence relation among the nodes, and generating a human gesture space context characteristic F at the tau moment by combining the left hand contour, the right hand contour and the head contour type output by the GPENτ
Obtaining human body gesture space context characteristics FτThen, using the LSTM network to extract the time sequence characteristics of the dynamic gesture; e.g. of the typeτ-1、hτ-1And FτIs the input to the LSTM network; wherein, FτThe characteristic values are synthesized by the relative length of each skeleton in the human body skeleton at the time tau, the included angle between the relative length and the gravitational acceleration of each skeleton, the left hand contour type, the right hand contour type and the head contour type; in addition, when tau is the initial time, the system randomly generates an initial value e0And h0,h0As a time-series characteristic of dynamic gestures, e0For LSTM network memory preservation, eτAnd hτIs the output of the network and is taken as tau>1, inputting the next moment of the LSTM network; wherein "sigmoid", "tanh" and "softmax" denote the activation function, PτRepresenting network output h at time ττObtaining the probability of the dynamic gesture category by activating a function 'softmax';when the network is trained, cross entropy function is adopted to calculate network loss, and a truncation back propagation algorithm is adopted to avoid the problem of gradient disappearance in the training;
initializing and setting neurons in the LSTM network by using Xavier, and training the LSTM network by using a truncation back propagation algorithm; in training, a human body gesture video data set is randomly cut into small pieces of videos with the length of 90 seconds, and 128 small pieces of videos are assembled to form a batch; the learning rate of the LSTM is 0.0004, and an Adam optimizer is adopted in the gradient descent algorithm; stopping LSTM network training after 50000 steps of accumulated training;
and after the GRSCTFF dynamic gesture recognition machine is built, training by adopting a public data set.
CN202010851927.1A 2020-08-21 2020-08-21 Gesture recognition method for fusing body skeleton and head and hand part profiles Pending CN112183198A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010851927.1A CN112183198A (en) 2020-08-21 2020-08-21 Gesture recognition method for fusing body skeleton and head and hand part profiles

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010851927.1A CN112183198A (en) 2020-08-21 2020-08-21 Gesture recognition method for fusing body skeleton and head and hand part profiles

Publications (1)

Publication Number Publication Date
CN112183198A true CN112183198A (en) 2021-01-05

Family

ID=73925010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010851927.1A Pending CN112183198A (en) 2020-08-21 2020-08-21 Gesture recognition method for fusing body skeleton and head and hand part profiles

Country Status (1)

Country Link
CN (1) CN112183198A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113269075A (en) * 2021-05-19 2021-08-17 广州繁星互娱信息科技有限公司 Gesture track recognition method and device, storage medium and electronic equipment
CN113269089A (en) * 2021-05-25 2021-08-17 上海人工智能研究院有限公司 Real-time gesture recognition method and system based on deep learning
CN113378641A (en) * 2021-05-12 2021-09-10 北京工业大学 Gesture recognition method based on deep neural network and attention mechanism
CN113660527A (en) * 2021-07-19 2021-11-16 广州紫为云科技有限公司 Real-time interactive somatosensory method, system and medium based on edge calculation
CN113840177A (en) * 2021-09-22 2021-12-24 广州博冠信息科技有限公司 Live broadcast interaction method and device, storage medium and electronic equipment
WO2022227768A1 (en) * 2021-04-28 2022-11-03 北京百度网讯科技有限公司 Dynamic gesture recognition method and apparatus, and device and storage medium
CN116152519A (en) * 2023-04-17 2023-05-23 深圳金三立视频科技股份有限公司 Feature extraction method and device based on image

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287844A (en) * 2019-06-19 2019-09-27 北京工业大学 Traffic police's gesture identification method based on convolution posture machine and long memory network in short-term
CN110837778A (en) * 2019-10-12 2020-02-25 南京信息工程大学 Traffic police command gesture recognition method based on skeleton joint point sequence

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287844A (en) * 2019-06-19 2019-09-27 北京工业大学 Traffic police's gesture identification method based on convolution posture machine and long memory network in short-term
CN110837778A (en) * 2019-10-12 2020-02-25 南京信息工程大学 Traffic police command gesture recognition method based on skeleton joint point sequence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何坚等: "基于长短时记忆和深度神经网络的视觉手势识别技术", 《图学学报》, 30 June 2020 (2020-06-30), pages 372 - 381 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022227768A1 (en) * 2021-04-28 2022-11-03 北京百度网讯科技有限公司 Dynamic gesture recognition method and apparatus, and device and storage medium
CN113378641A (en) * 2021-05-12 2021-09-10 北京工业大学 Gesture recognition method based on deep neural network and attention mechanism
CN113378641B (en) * 2021-05-12 2024-04-09 北京工业大学 Gesture recognition method based on deep neural network and attention mechanism
CN113269075A (en) * 2021-05-19 2021-08-17 广州繁星互娱信息科技有限公司 Gesture track recognition method and device, storage medium and electronic equipment
CN113269089A (en) * 2021-05-25 2021-08-17 上海人工智能研究院有限公司 Real-time gesture recognition method and system based on deep learning
CN113269089B (en) * 2021-05-25 2023-07-18 上海人工智能研究院有限公司 Real-time gesture recognition method and system based on deep learning
CN113660527A (en) * 2021-07-19 2021-11-16 广州紫为云科技有限公司 Real-time interactive somatosensory method, system and medium based on edge calculation
CN113840177A (en) * 2021-09-22 2021-12-24 广州博冠信息科技有限公司 Live broadcast interaction method and device, storage medium and electronic equipment
CN113840177B (en) * 2021-09-22 2024-04-30 广州博冠信息科技有限公司 Live interaction method and device, storage medium and electronic equipment
CN116152519A (en) * 2023-04-17 2023-05-23 深圳金三立视频科技股份有限公司 Feature extraction method and device based on image
CN116152519B (en) * 2023-04-17 2023-08-15 深圳金三立视频科技股份有限公司 Feature extraction method and device based on image

Similar Documents

Publication Publication Date Title
Bhattacharya et al. Step: Spatial temporal graph convolutional networks for emotion perception from gaits
CN112183198A (en) Gesture recognition method for fusing body skeleton and head and hand part profiles
CN110427867B (en) Facial expression recognition method and system based on residual attention mechanism
CN109919031B (en) Human behavior recognition method based on deep neural network
Wang et al. Depth pooling based large-scale 3-d action recognition with convolutional neural networks
Özyer et al. Human action recognition approaches with video datasets—A survey
CN108256421A (en) A kind of dynamic gesture sequence real-time identification method, system and device
CN110287844B (en) Traffic police gesture recognition method based on convolution gesture machine and long-and-short-term memory network
Yang et al. Extraction of 2d motion trajectories and its application to hand gesture recognition
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
Asif et al. A multi-modal, discriminative and spatially invariant CNN for RGB-D object labeling
CN106778796B (en) Human body action recognition method and system based on hybrid cooperative training
Arif et al. Automated body parts estimation and detection using salient maps and Gaussian matrix model
Shbib et al. Facial expression analysis using active shape model
CN106909938B (en) Visual angle independence behavior identification method based on deep learning network
Shen et al. Emotion recognition based on multi-view body gestures
Rao et al. Sign Language Recognition System Simulated for Video Captured with Smart Phone Front Camera.
CN109086659B (en) Human behavior recognition method and device based on multi-channel feature fusion
CN111680550B (en) Emotion information identification method and device, storage medium and computer equipment
Lee et al. 3-D human behavior understanding using generalized TS-LSTM networks
Rao et al. Selfie sign language recognition with multiple features on adaboost multilabel multiclass classifier
Kumar et al. 3D sign language recognition using spatio temporal graph kernels
CN112906520A (en) Gesture coding-based action recognition method and device
Yang et al. TS-YOLO: An efficient YOLO network for multi-scale object detection
Lima et al. Simple and efficient pose-based gait recognition method for challenging environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination