CN112633209B - Human action recognition method based on graph convolution neural network - Google Patents
Human action recognition method based on graph convolution neural network Download PDFInfo
- Publication number
- CN112633209B CN112633209B CN202011600579.7A CN202011600579A CN112633209B CN 112633209 B CN112633209 B CN 112633209B CN 202011600579 A CN202011600579 A CN 202011600579A CN 112633209 B CN112633209 B CN 112633209B
- Authority
- CN
- China
- Prior art keywords
- graph
- network
- neural network
- video
- human
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 241000282414 Homo sapiens Species 0.000 title claims abstract description 47
- 238000000034 method Methods 0.000 title claims abstract description 45
- 230000009471 action Effects 0.000 title claims abstract description 44
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 33
- 210000000988 bone and bone Anatomy 0.000 claims abstract description 32
- 230000008859 change Effects 0.000 claims abstract description 6
- 238000012216 screening Methods 0.000 claims abstract description 5
- 230000002123 temporal effect Effects 0.000 claims abstract description 4
- 239000011159 matrix material Substances 0.000 claims description 33
- 210000003127 knee Anatomy 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 7
- 238000010586 diagram Methods 0.000 claims description 4
- 230000003213 activating effect Effects 0.000 claims description 3
- 238000011166 aliquoting Methods 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 230000003287 optical effect Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 210000003692 ilium Anatomy 0.000 description 1
- 238000007654 immersion Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
- G06V20/42—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a human action recognition method based on a graph convolution neural network, which comprises the steps of preparing human action video data, marking, and marking video labels according to different kinds of actions; extracting skeleton key point features from human action video data by using an openpost attitude estimation algorithm, calculating the change speed of skeleton key points of adjacent frames through a skeleton point mainstream network, and performing feature splicing; screening the bone key points, calculating the included angles of the screened bone key points through an angle branch network, and performing characteristic splicing; transmitting the spliced data to a graph neural network; extending the graph convolution from the spatial domain to the temporal domain; using a cross-attention model to enhance the performance of the network; human action recognition. The invention can recognize and output the actions expressed by human beings in the input video, has good usability and robustness, and lays a certain foundation for the artificial intelligence technology to actually land in the field of action recognition.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a human action recognition method based on a graph convolution neural network.
Background
Artificial intelligence technology has been radiating to various industries, and motion recognition technology is a key technology for a number of popular applications and demands, and has become one of the most interesting directions in the field of computer vision. For example, detection and alarm of abnormal human behaviors in an intelligent monitoring camera, classification and retrieval of human behaviors in a video, including adoption of a motion acquisition technology in a high-image-quality game, can put the motion of professional actors into the game, and bring immersion to players. It is believed that there will be more and more applications for motion recognition technology in the future.
The computer vision field at present often applies similar techniques to human motion recognition directions, which are mainly divided into two methods, one is a video-based method of RGB and optical flow, and the other is a method based on human skeletal key points. The video-based RGB and optical flow method can perform end-to-end learning on a task, but is a very heavy task for extracting optical flow from video, and although various methods for reducing the loss caused by extracting optical flow exist at present, optical flow is a powerful characteristic for motion recognition starting task all the time. The method based on human skeleton key points is a new motion recognition method after the development and maturity of the gesture estimation technology, and compared with the traditional method based on RGB and optical flow of video, the method can be used for more effectively modeling human behaviors, because the traditional method cannot avoid the influence caused by background and light transformation. On the other hand, it requires feature extraction of the video using pose estimation algorithms, which is one more step than the conventional method in this respect. In addition, the existing method for identifying the motion simply utilizes skeleton key point data, and the information describing the motion is not only coordinates, angles and changing speeds thereof, but also an important element of motion identification feature description.
Thus, for the current state of the art, and the complexity of the actions themselves, there is a need for a human action recognition method with a deep learning theoretical basis and more descriptive elements for this task.
Disclosure of Invention
The invention aims at providing a human action recognition method based on a graph convolution neural network aiming at the current situation of the field and the complexity of actions per se.
In order to achieve the above purpose, the invention is implemented according to the following technical scheme:
a human action recognition method based on a graph convolution neural network comprises the following steps:
step 1: preparing human action video data, marking the video data, and marking the video labels according to different kinds of actions;
step 2: extracting skeleton key point features from human action video data by using an openpost attitude estimation algorithm, calculating the change speed of skeleton key points of adjacent frames through a skeleton point mainstream network, and performing feature splicing; screening the bone key points, calculating the included angles of the screened bone key points through an angle branch network, and performing characteristic splicing;
step 3: transmitting the spliced data to a graph neural network;
step 4: extending the graph convolution from the spatial domain to the temporal domain;
step 5: using a cross-attention model to enhance the performance of the network;
step 6: and constructing a graph convolution neural network consisting of nine space-time convolution modules, a global pooling layer and a Softmax layer, wherein the global pooling layer is used for summarizing node characteristics in a graph structure so as to upgrade the node-level characteristics into the graph-level characteristics, and then outputting the action numbers of people in a human action video through the Softmax layer.
Further, the step 2 specifically includes:
step 2.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.2: using an openphase attitude estimation algorithm to extract key points of human bones, taking 15 aliquoting points s= (T1, the term, T2, the term, T3, the term, T4, the term, T5, the term, T6, the term, T15) for the video S, saving the key point data of the bones of each point, extracting 18 key points of bones each time, respectively representing 18 parts of the human body, setting the length of a single frame video as L, setting the video width as W, normalizing the extracted key point coordinates of the bones, and using Tn to represent the key point data of the n-th frame, wherein the normalized Tn:
wherein x is n Is the abscissa of the nth bone key point, y n The ordinate of the nth skeleton key point is Tn, namely the normalized skeleton key point coordinate of the nth frame;
step 2.3: calculating the change speed of key points of adjacent frames, and the speed Vn:
V n =((x 1n -x 1n-1 ,y 1n -y 1n-1 ),(x 2n -x 2n-1 ,y 2n -y 2n-1 ),...,(x 18n -x 18n-1 ,y 18n -y 18n-1 ))
the method comprises the steps of carrying out a first treatment on the surface of the Wherein the specific meaning of x and y is the same as in step 2.2; after the speed V is obtained, characteristic splicing is carried out, and the total characteristic Dn after splicing:
D n =Cancate(T n ,T n ′,V n );
wherein T is n And T n ' respectively representing normalized bone key point coordinates obtained at the side and the front of the n moment, and the candate function represents splicing the variables in brackets;
step 2.4: screening skeleton key points extracted by openpost, and storing left knee, right knee, left waist, right waist, left shoulder, right shoulder, left elbow and right elbow;
step 2.5: calculating an included angle:
(5) Knee:
(6) Waist:
(7) Shoulder:
(8) Elbow of hand):
further, the step 3 specifically includes:
step 3.1: the default human skeleton structure identified by the openpost posture estimation algorithm is used as the basic connection of the graphic neural network, and the adjacency matrix of the graphic neural network structure is set as A k The adjacency matrix representing the k-th network is an N x N two-dimensional matrix, where N is equal to 18, representing 18 skeletal key points; the A (n 1, n 2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection and a value of 0 represents disconnection;
step 3.2: setting the adjacency matrix of the graph neural network structure as B k It represents the action structure adjacency matrix of the k-th layer; the matrix is also an N x N two-dimensional matrix having the same meaning as a except that the matrix has no fixed value, each element of which is a trainable parameter;
step 3.3: setting the adjacency matrix of the graph neural network structure as C k The format is consistent with A and B, C k (n1,n2):
The process is a normalized Gaussian embedding method to calculate any two bonesSimilarity between critical points of the ilium, θ andrespectively two embedding methods, wherein T represents matrix transposition, so that the final output dimension is unchanged; the Softmax method changes the final value to 0 and 1, indicating whether or not connected; the output formula of the final graph neural network is:
wherein f in And f out Respectively representing the input and the output of the layer network, K represents the total layer number of the graph neural network, and W represents the convolution parameter.
Further, the step 4 specifically includes:
for point n ij Definition i represents the ith frame, j represents the jth bone key, each time domain convolution involves only the same bone key, then the formula:
w is a convolution parameter that is a convolution parameter,an output of the n-th layer.
Further, the step 5 specifically includes:
step 5.1: cross attention enhances the expressive power of the main network flow through the feature map of the skeletal joint angle network branches, the formula of which is:
f attention =(1+Attention)*f out ;
step 5.2: the method for calculating the Attention comprises the following steps:
Attention=g(f self ,f cross )*f out
wherein f self Is the self-attention weight of the main network output characteristic diagram, f cross Is the angle of the joint and the mainCross-attention weights of network data, added to add weight to the primary network feature map, where g represents a transformation of both dimensions to f out Is added up, wherein f cross The method comprises the following steps:
wherein v (T, N, d) is a main network feature graph, N is the number of main network data skeleton nodes, and d represents the feature dimension of each node; a (T, k, m) is a joint angle network feature map, k represents the number of joints of the skeletal joint angle data, and m is its dimension.
Further, the step 6 specifically includes:
step 6.1: the input is firstly carried out to keep a residual error and is connected to the module, the first operation is to carry out space domain graph convolution, then to carry out batch normalization operation BatchNormalization, reLU to activate a layer, a dropout layer with a coefficient of 0.5, then to carry out space graph convolution, then to connect with batch normalization operation Batchnormalization and ReLU activating layers, and the network overall structure is composed of nine space-time convolution modules, a global pooling layer and a Softmax layer.
Step 6.2: the global pooling layer in the network is used for summarizing node characteristics in the graph structure, so that the node-level characteristics are updated to the graph-level characteristics, and then the action numbers of people in the video are output through the Softmax layer.
Compared with the prior art, the human action recognition method based on the graph convolution neural network can recognize and output actions represented by human beings in an input video, has good usability and robustness, and lays a foundation for the actual landing of artificial intelligence technology in the field of action recognition.
Drawings
Fig. 1 is a flowchart of a human motion recognition method based on a graph convolutional neural network according to the present invention.
Fig. 2 is a cross-attention network architecture.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. The specific embodiments described herein are for purposes of illustration only and are not intended to limit the invention.
As shown in fig. 1 and 2, the present embodiment provides a human action recognition method based on a graph convolution neural network, which includes the following steps:
step 1: preparing human action video data, marking the video label according to different kinds of actions, and starting from 0;
step 2.1: performing feature extraction and feature design on the basic data to serve as motion information features;
step 2.1.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.1.2: using the opensense pose estimation algorithm to perform human skeletal key point extraction, we take 15 aliquoting points s= (T1, i.e., T2, i.e., T3, i.e., T4, i.e., T5, i.e., T6, i.e., T15) for the video S, and save the skeletal key point data for each point. 18 bone key points are extracted each time, and respectively represent 18 parts of a human body. Setting the length of a single frame video as L, setting the video width as W, carrying out normalization processing on the extracted skeleton key point coordinates, and using Tn to represent the skeleton key point data of the nth frame, wherein the normalized Tn is as follows:
wherein x is n Is the abscissa of the nth bone key point, y n And Tn is the ordinate of the nth bone key point, and Tn is the normalized bone key point coordinate of the nth frame.
Step 2.1.3: calculating the change speed of key points of adjacent frames, and the speed Vn:
V n =((x 1n -x 1n-1 ,y 1n -y 1n-1 ),(x 2n -x 2n-1 ,y 2n -y 2n-1 ),...,(x 18n -x 18n-1 ,y 18n -y 18n-1 ))
wherein x and y have the same meaning as in step 2.2. After obtaining the speed V, performing feature stitching, wherein the total stitched features Dn are as follows:
D n =Cancate(T n ,T n ′,V n )
wherein T is n And T' n The normalized bone key point coordinates obtained at the side and front of time n are shown, and the candate function shows the concatenation of the variables in brackets.
Step 2.2: the bone point data is further refined to be used as high-order information, and the data in the step 2.1 form a double-flow network to complement each other;
step 2.2.1: because the joint angle is critical to the action category, the key points of the human bones extracted by openpost are screened, and the left knee, the right knee, the left waist, the right waist, the left shoulder, the right shoulder, the left elbow and the right elbow are saved;
step 2.2.2: calculating an included angle:
(1) Knee:
(2) Waist:
(3) Shoulder:
(4) Elbow of hand):
step 3: transmitting the spliced data into a graphic neural network, wherein the graphic neural network structure mainly comprises three parts;
step 3.1: the first part adopts a default human skeleton structure identified by an opensense attitude estimation algorithm as basic connection of a graph neural network, the function of the basic structure of the first part is to adapt to basic motion forms of human beings, and the basic structure has certain modeling capability on any form of motion, and an adjacency matrix of the graph structure is set as A k Representing the k-th network, which is an N x N two-dimensional matrix, where N is equal to 18, representing 18 skeletal keypoints. The A (n 1, n 2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection and a value of 0 represents disconnection;
step 3.2: second part to compensate the fitting ability of the basic structure to the motion diversity, we set the adjacency matrix of the structure as B k It represents the structural adjacency matrix of the k-th layer. The matrix is also an N multiplied by N two-dimensional matrix, the meaning is the same as that of A, except that the matrix has no fixed value, each element of the matrix is a trainable parameter, and the training stage automatically learns which connection modes have better compensation effect on actions;
step 3.3: the third part is a data-driven graph structure, which has different values for each different action, we set the matrix to C k The format is consistent with A and B, C k (n1,n2):
The process is a normalized Gaussian embedding method to calculate the similarity between any two bone key points, θ andtwo embedding methods are respectively adopted, T represents matrix transposition, and the final output dimension is unchanged. The Softmax method changes the final value to 0 and 1, indicating whether or not connected. The output of the final graph neural network is shown as:
wherein f in And f out Respectively representing the input and the output of the layer network, K represents the total layer number of the graph neural network, A matrix B matrix C matrix is described in the steps, and W represents convolution parameters;
step 4: extending the graph convolution from the spatial domain to the temporal domain for point n ij At this time, we define i to represent the ith frame and j to represent the jth skeletal keypoint, and we only refer to the same *** keypoint per time domain convolution, then the formula is:
w is a convolution parameter that is a convolution parameter,the output of the nth layer, other variables are defined as same.
Step 5: the performance of the network is enhanced by using a cross-attention model, which is constructed as shown in fig. 2:
step 5.1: cross attention enhances the expressive power of the main network flow through the feature map of the skeletal joint angle network branches, the formula of which is:
f attention =(1+Attention)*f out
the self-attention model is a residual attention model because as the number of network layers deepens, simple attention stacking will cause some features to disappear.
Step 5.2: the method for calculating the Attention comprises the following steps:
Attention=g(f self ,f cross )*f out
wherein f self Is the self-attention weight of the main network output characteristic diagram, f cross Is the joint angle and the cross attention weight of the main network data, and the joint angle and the cross attention weight are added to form the main networkThe feature map increases the weight. Where g represents transforming both dimensions to f out And added together. Wherein f cross The method comprises the following steps:
wherein v (T, N, d) is a main network feature graph, N is the number of main network data skeleton nodes, and d represents the feature dimension of each node; a (T, k, m) is a joint angle network feature map, k represents the number of joints of the skeletal joint angle data, and m is its dimension. The formula calculates the association between the different nodes of the two networks respectively and uses the association as cross attention.
Step 6: the convolution details of the spatial domain and the time domain are described in detail by the steps 3 and 4, and a space-time diagram convolution module is a complete system. The input is performed first to keep a residual error connected to the module, and finally, the first operation is to perform space domain graph convolution, then batch normalization operation BatchNormalization, reLU to activate layers, dropout layers with coefficients of 0.5, then perform space domain graph convolution, and then connect batch normalization operations Batchnormalization and ReLU activating layers. And the two branches of the network are respectively composed of nine space-time convolution modules, a global pooling layer and a Softmax layer. Finally, the action category is obtained through a Softmax classifier.
Of course, before the graph convolutional neural network of the present embodiment is used to identify human actions, training of the model is performed first, the training part uses Pytorch framework, and uses cross entropy loss function, whose formula is:
Loss=-[ylogy`+(1-y)log(1-y`)]
where y is the label of the sample and y' is the result of our model predictions. We set the batch size to 64 during training, optimize using SGD random gradient descent with momentum of 0.9, and set the weight decay to 0.0001 for a total of 30 epochs.
The technical scheme of the invention is not limited to the specific embodiment, and all technical modifications made according to the technical scheme of the invention fall within the protection scope of the invention.
Claims (4)
1. The human action recognition method based on the graph convolution neural network is characterized by comprising the following steps of:
step 1: preparing human action video data, marking the video data, and marking the video labels according to different kinds of actions;
step 2: extracting skeleton key point features from human action video data by using an openpost attitude estimation algorithm, calculating the change speed of skeleton key points of adjacent frames through a skeleton point mainstream network, and performing feature splicing; screening the bone key points, calculating the included angles of the screened bone key points through an angle branch network, and performing characteristic splicing;
step 2.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.2: using an openphase attitude estimation algorithm to extract key points of human bones, taking 15 aliquoting points s= (T1, the term, T2, the term, T3, the term, T4, the term, T5, the term, T6, the term, T15) for the video S, saving the key point data of the bones of each point, extracting 18 key points of bones each time, respectively representing 18 parts of the human body, setting the length of a single frame video as L, setting the video width as W, normalizing the extracted key point coordinates of the bones, and using Tn to represent the key point data of the n-th frame, wherein the normalized Tn:
wherein x is n Is the abscissa of the nth bone key point, y n The ordinate of the nth skeleton key point is Tn, namely the normalized skeleton key point coordinate of the nth frame;
step 2.3: calculating the change speed of key points of adjacent frames, and the speed Vn:
V n =((x 1n -x 1n-1 ,y 1n -y 1n-1 ),(x 2n -x 2n-1 ,y 2n -y 2n-1 ),...,(x 18n -x 18n-1 ,y 18n -y 18n-1 ));
wherein the specific meaning of x and y is the same as in step 2.2; after the speed V is obtained, characteristic splicing is carried out, and the total characteristic Dn after splicing:
D n =Cancate(T n ,T′ n ,V n );
wherein T is n And T' n The normalized bone key point coordinates obtained at the side and the front of the n moment are respectively represented, and the function of candate represents that the variables in brackets are spliced;
step 2.4: screening skeleton key points extracted by openpost, and storing left knee, right knee, left waist, right waist, left shoulder, right shoulder, left elbow and right elbow;
step 2.5: calculating an included angle:
(1) Knee:
(2) Waist:
(3) Shoulder:
(4) Elbow of hand):
step 3: transmitting the spliced data to a graph neural network;
step 4: extending the graph convolution from the spatial domain to the temporal domain;
step 5: using a cross-attention model to enhance the performance of the network;
step 5.1: cross attention enhances the expressive power of the main network flow through the feature map of the skeletal joint angle network branches, the formula of which is:
f attention =(1+Attention)*f out ;
step 5.2: the method for calculating the Attention comprises the following steps:
Attention=g(f self ,f cross )*f out
wherein f self Is the self-attention weight of the main network output characteristic diagram, f cross Adding the joint angles and the cross-attention weights of the primary network data to add weights to the primary network feature map, where g represents transforming both dimensions to f out Is added up, wherein f cross The method comprises the following steps:
wherein v (T, N, d) is a main network feature graph, N is the number of main network data skeleton nodes, and d represents the feature dimension of each node; a (T, k, m) is a joint angle network feature map, k represents the number of joints of the bone joint angle data, and m is the dimension thereof;
step 6: and constructing a graph convolution neural network consisting of nine space-time convolution modules, a global pooling layer and a Softmax layer, wherein the global pooling layer is used for summarizing node characteristics in a graph structure so as to upgrade the node-level characteristics into the graph-level characteristics, and then outputting the action numbers of people in a human action video through the Softmax layer.
2. The method for identifying human actions based on the graph roll-up neural network according to claim 1, wherein the step 3 specifically comprises:
step 3.1: the default human skeleton structure identified by the openpost posture estimation algorithm is used as the basic connection of the graphic neural network, and the adjacency matrix of the graphic neural network structure is set as A k The adjacency matrix representing the k-th network is an N x N two-dimensional matrix, where N is equal to 18, representing 18 skeletal key points; the A (n 1, n 2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection and a value of 0 represents disconnection;
step 3.2: setting the adjacency matrix of the graph neural network structure as B k It represents the action structure adjacency matrix of the k-th layer; the matrix is also an N x N two-dimensional matrix having the same meaning as a except that the matrix has no fixed value, each element of which is a trainable parameter;
step 3.3: setting the adjacency matrix of the graph neural network structure as C k The format is consistent with A and B, C k (n1,n2):
The process is a normalized Gaussian embedding method to calculate the similarity between any two bone key points, θ andrespectively two embedding methods, wherein T represents matrix transposition, so that the final output dimension is unchanged; the Softmax method changes the final value to 0 and 1, indicating whether or not connected; the output formula of the final graph neural network is:
wherein f in And f out Respectively representing the input and the output of the layer network, K represents the total layer number of the graph neural network, and W represents the convolution parameter.
3. The method for identifying human actions based on the graph roll-up neural network according to claim 2, wherein the step 4 specifically comprises:
for point n ij Definition i represents the ith frame, j represents the jth bone key, each time domain convolution involves only the same bone key, then the formula:
w is a convolution parameter that is a convolution parameter,an output of the n-th layer.
4. A method for identifying human actions based on a graph convolutional neural network according to claim 3, wherein said step 6 specifically comprises:
step 6.1: the input is firstly carried out, a residual error is reserved and connected to the module, the first operation is carrying out space domain graph convolution, then batch normalization operation BatchNormalization, reLU is carried out to activate a layer, a dropout layer with a coefficient of 0.5 is carried out, then space graph convolution is carried out, then batch normalization operation Batchnormalization and ReLU activating layers are connected, and the network overall structure is composed of nine space-time convolution modules, a global pooling layer and a Softmax layer;
step 6.2: the global pooling layer in the network is used for summarizing node characteristics in the graph structure, so that the node-level characteristics are updated to the graph-level characteristics, and then the action numbers of people in the video are output through the Softmax layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011600579.7A CN112633209B (en) | 2020-12-29 | 2020-12-29 | Human action recognition method based on graph convolution neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011600579.7A CN112633209B (en) | 2020-12-29 | 2020-12-29 | Human action recognition method based on graph convolution neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112633209A CN112633209A (en) | 2021-04-09 |
CN112633209B true CN112633209B (en) | 2024-04-09 |
Family
ID=75286366
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011600579.7A Active CN112633209B (en) | 2020-12-29 | 2020-12-29 | Human action recognition method based on graph convolution neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112633209B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113255514B (en) * | 2021-05-24 | 2023-04-07 | 西安理工大学 | Behavior identification method based on local scene perception graph convolutional network |
CN113378656B (en) * | 2021-05-24 | 2023-07-25 | 南京信息工程大学 | Action recognition method and device based on self-adaptive graph convolution neural network |
CN113361352A (en) * | 2021-05-27 | 2021-09-07 | 天津大学 | Student classroom behavior analysis monitoring method and system based on behavior recognition |
CN113392743B (en) * | 2021-06-04 | 2023-04-07 | 北京格灵深瞳信息技术股份有限公司 | Abnormal action detection method, abnormal action detection device, electronic equipment and computer storage medium |
CN114613011A (en) * | 2022-03-17 | 2022-06-10 | 东华大学 | Human body 3D (three-dimensional) bone behavior identification method based on graph attention convolutional neural network |
CN114998990B (en) * | 2022-05-26 | 2023-07-25 | 深圳市科荣软件股份有限公司 | Method and device for identifying safety behaviors of personnel on construction site |
CN115050101B (en) * | 2022-07-18 | 2024-03-22 | 四川大学 | Gait recognition method based on fusion of skeleton and contour features |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110532960A (en) * | 2019-08-30 | 2019-12-03 | 西安交通大学 | A kind of action identification method of the target auxiliary based on figure neural network |
CN110705463A (en) * | 2019-09-29 | 2020-01-17 | 山东大学 | Video human behavior recognition method and system based on multi-mode double-flow 3D network |
CN111709321A (en) * | 2020-05-28 | 2020-09-25 | 西安交通大学 | Human behavior recognition method based on graph convolution neural network |
-
2020
- 2020-12-29 CN CN202011600579.7A patent/CN112633209B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110532960A (en) * | 2019-08-30 | 2019-12-03 | 西安交通大学 | A kind of action identification method of the target auxiliary based on figure neural network |
CN110705463A (en) * | 2019-09-29 | 2020-01-17 | 山东大学 | Video human behavior recognition method and system based on multi-mode double-flow 3D network |
CN111709321A (en) * | 2020-05-28 | 2020-09-25 | 西安交通大学 | Human behavior recognition method based on graph convolution neural network |
Non-Patent Citations (4)
Title |
---|
Action Recognition Based on Spatial Temporal Graph Convolutional Networks;Wanqiang Zheng等;《Proceedings of the 3rd International Conference on Computer Science and Application EngineeringOctober 2019》;1-5 * |
Skeleton-Based Action Recognition With Directed Graph Neural Networks;Lei Shi等;《Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》;7912-7921 * |
基于机器视觉的运动姿态分析***研究;陈永康;《中国优秀硕士学位论文全文数据库 社会科学Ⅱ辑》(第2期);H134-354 * |
基于深度学习的行为识别算法综述;赫磊;邵展鹏;张剑华;周小龙;;计算机科学(第S1期);149-157 * |
Also Published As
Publication number | Publication date |
---|---|
CN112633209A (en) | 2021-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112633209B (en) | Human action recognition method based on graph convolution neural network | |
CN109685115B (en) | Fine-grained conceptual model with bilinear feature fusion and learning method | |
CN108491880B (en) | Object classification and pose estimation method based on neural network | |
KR102450374B1 (en) | Method and device to train and recognize data | |
Liu et al. | Multi-objective convolutional learning for face labeling | |
CN111291809B (en) | Processing device, method and storage medium | |
Chen et al. | A UAV-based forest fire detection algorithm using convolutional neural network | |
CN110222718B (en) | Image processing method and device | |
CN105631398A (en) | Method and apparatus for recognizing object, and method and apparatus for training recognizer | |
CN108961253A (en) | A kind of image partition method and device | |
CN111582095B (en) | Light-weight rapid detection method for abnormal behaviors of pedestrians | |
CN109919085B (en) | Human-human interaction behavior identification method based on light-weight convolutional neural network | |
WO2021073311A1 (en) | Image recognition method and apparatus, computer-readable storage medium and chip | |
Xia et al. | Face occlusion detection based on multi-task convolution neural network | |
CN108154156B (en) | Image set classification method and device based on neural topic model | |
CN112801015A (en) | Multi-mode face recognition method based on attention mechanism | |
CN110765960B (en) | Pedestrian re-identification method for adaptive multi-task deep learning | |
CN110633624A (en) | Machine vision human body abnormal behavior identification method based on multi-feature fusion | |
CN111400572A (en) | Content safety monitoring system and method for realizing image feature recognition based on convolutional neural network | |
CN110598746A (en) | Adaptive scene classification method based on ODE solver | |
CN114463837A (en) | Human behavior recognition method and system based on self-adaptive space-time convolution network | |
Liu | Human face expression recognition based on deep learning-deep convolutional neural network | |
US20080232682A1 (en) | System and method for identifying patterns | |
CN114170659A (en) | Facial emotion recognition method based on attention mechanism | |
CN117611838A (en) | Multi-label image classification method based on self-adaptive hypergraph convolutional network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |