CN112633209B - Human action recognition method based on graph convolution neural network - Google Patents

Human action recognition method based on graph convolution neural network Download PDF

Info

Publication number
CN112633209B
CN112633209B CN202011600579.7A CN202011600579A CN112633209B CN 112633209 B CN112633209 B CN 112633209B CN 202011600579 A CN202011600579 A CN 202011600579A CN 112633209 B CN112633209 B CN 112633209B
Authority
CN
China
Prior art keywords
graph
network
neural network
video
human
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011600579.7A
Other languages
Chinese (zh)
Other versions
CN112633209A (en
Inventor
毛克明
李翰鹏
Original Assignee
东北大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东北大学 filed Critical 东北大学
Priority to CN202011600579.7A priority Critical patent/CN112633209B/en
Publication of CN112633209A publication Critical patent/CN112633209A/en
Application granted granted Critical
Publication of CN112633209B publication Critical patent/CN112633209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a human action recognition method based on a graph convolution neural network, which comprises the steps of preparing human action video data, marking, and marking video labels according to different kinds of actions; extracting skeleton key point features from human action video data by using an openpost attitude estimation algorithm, calculating the change speed of skeleton key points of adjacent frames through a skeleton point mainstream network, and performing feature splicing; screening the bone key points, calculating the included angles of the screened bone key points through an angle branch network, and performing characteristic splicing; transmitting the spliced data to a graph neural network; extending the graph convolution from the spatial domain to the temporal domain; using a cross-attention model to enhance the performance of the network; human action recognition. The invention can recognize and output the actions expressed by human beings in the input video, has good usability and robustness, and lays a certain foundation for the artificial intelligence technology to actually land in the field of action recognition.

Description

Human action recognition method based on graph convolution neural network
Technical Field
The invention relates to the technical field of computer vision, in particular to a human action recognition method based on a graph convolution neural network.
Background
Artificial intelligence technology has been radiating to various industries, and motion recognition technology is a key technology for a number of popular applications and demands, and has become one of the most interesting directions in the field of computer vision. For example, detection and alarm of abnormal human behaviors in an intelligent monitoring camera, classification and retrieval of human behaviors in a video, including adoption of a motion acquisition technology in a high-image-quality game, can put the motion of professional actors into the game, and bring immersion to players. It is believed that there will be more and more applications for motion recognition technology in the future.
The computer vision field at present often applies similar techniques to human motion recognition directions, which are mainly divided into two methods, one is a video-based method of RGB and optical flow, and the other is a method based on human skeletal key points. The video-based RGB and optical flow method can perform end-to-end learning on a task, but is a very heavy task for extracting optical flow from video, and although various methods for reducing the loss caused by extracting optical flow exist at present, optical flow is a powerful characteristic for motion recognition starting task all the time. The method based on human skeleton key points is a new motion recognition method after the development and maturity of the gesture estimation technology, and compared with the traditional method based on RGB and optical flow of video, the method can be used for more effectively modeling human behaviors, because the traditional method cannot avoid the influence caused by background and light transformation. On the other hand, it requires feature extraction of the video using pose estimation algorithms, which is one more step than the conventional method in this respect. In addition, the existing method for identifying the motion simply utilizes skeleton key point data, and the information describing the motion is not only coordinates, angles and changing speeds thereof, but also an important element of motion identification feature description.
Thus, for the current state of the art, and the complexity of the actions themselves, there is a need for a human action recognition method with a deep learning theoretical basis and more descriptive elements for this task.
Disclosure of Invention
The invention aims at providing a human action recognition method based on a graph convolution neural network aiming at the current situation of the field and the complexity of actions per se.
In order to achieve the above purpose, the invention is implemented according to the following technical scheme:
a human action recognition method based on a graph convolution neural network comprises the following steps:
step 1: preparing human action video data, marking the video data, and marking the video labels according to different kinds of actions;
step 2: extracting skeleton key point features from human action video data by using an openpost attitude estimation algorithm, calculating the change speed of skeleton key points of adjacent frames through a skeleton point mainstream network, and performing feature splicing; screening the bone key points, calculating the included angles of the screened bone key points through an angle branch network, and performing characteristic splicing;
step 3: transmitting the spliced data to a graph neural network;
step 4: extending the graph convolution from the spatial domain to the temporal domain;
step 5: using a cross-attention model to enhance the performance of the network;
step 6: and constructing a graph convolution neural network consisting of nine space-time convolution modules, a global pooling layer and a Softmax layer, wherein the global pooling layer is used for summarizing node characteristics in a graph structure so as to upgrade the node-level characteristics into the graph-level characteristics, and then outputting the action numbers of people in a human action video through the Softmax layer.
Further, the step 2 specifically includes:
step 2.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.2: using an openphase attitude estimation algorithm to extract key points of human bones, taking 15 aliquoting points s= (T1, the term, T2, the term, T3, the term, T4, the term, T5, the term, T6, the term, T15) for the video S, saving the key point data of the bones of each point, extracting 18 key points of bones each time, respectively representing 18 parts of the human body, setting the length of a single frame video as L, setting the video width as W, normalizing the extracted key point coordinates of the bones, and using Tn to represent the key point data of the n-th frame, wherein the normalized Tn:
wherein x is n Is the abscissa of the nth bone key point, y n The ordinate of the nth skeleton key point is Tn, namely the normalized skeleton key point coordinate of the nth frame;
step 2.3: calculating the change speed of key points of adjacent frames, and the speed Vn:
V n =((x 1n -x 1n-1 ,y 1n -y 1n-1 ),(x 2n -x 2n-1 ,y 2n -y 2n-1 ),...,(x 18n -x 18n-1 ,y 18n -y 18n-1 ))
the method comprises the steps of carrying out a first treatment on the surface of the Wherein the specific meaning of x and y is the same as in step 2.2; after the speed V is obtained, characteristic splicing is carried out, and the total characteristic Dn after splicing:
D n =Cancate(T n ,T n ′,V n );
wherein T is n And T n ' respectively representing normalized bone key point coordinates obtained at the side and the front of the n moment, and the candate function represents splicing the variables in brackets;
step 2.4: screening skeleton key points extracted by openpost, and storing left knee, right knee, left waist, right waist, left shoulder, right shoulder, left elbow and right elbow;
step 2.5: calculating an included angle:
(5) Knee:
(6) Waist:
(7) Shoulder:
(8) Elbow of hand):
further, the step 3 specifically includes:
step 3.1: the default human skeleton structure identified by the openpost posture estimation algorithm is used as the basic connection of the graphic neural network, and the adjacency matrix of the graphic neural network structure is set as A k The adjacency matrix representing the k-th network is an N x N two-dimensional matrix, where N is equal to 18, representing 18 skeletal key points; the A (n 1, n 2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection and a value of 0 represents disconnection;
step 3.2: setting the adjacency matrix of the graph neural network structure as B k It represents the action structure adjacency matrix of the k-th layer; the matrix is also an N x N two-dimensional matrix having the same meaning as a except that the matrix has no fixed value, each element of which is a trainable parameter;
step 3.3: setting the adjacency matrix of the graph neural network structure as C k The format is consistent with A and B, C k (n1,n2):
The process is a normalized Gaussian embedding method to calculate any two bonesSimilarity between critical points of the ilium, θ andrespectively two embedding methods, wherein T represents matrix transposition, so that the final output dimension is unchanged; the Softmax method changes the final value to 0 and 1, indicating whether or not connected; the output formula of the final graph neural network is:
wherein f in And f out Respectively representing the input and the output of the layer network, K represents the total layer number of the graph neural network, and W represents the convolution parameter.
Further, the step 4 specifically includes:
for point n ij Definition i represents the ith frame, j represents the jth bone key, each time domain convolution involves only the same bone key, then the formula:
w is a convolution parameter that is a convolution parameter,an output of the n-th layer.
Further, the step 5 specifically includes:
step 5.1: cross attention enhances the expressive power of the main network flow through the feature map of the skeletal joint angle network branches, the formula of which is:
f attention =(1+Attention)*f out
step 5.2: the method for calculating the Attention comprises the following steps:
Attention=g(f self ,f cross )*f out
wherein f self Is the self-attention weight of the main network output characteristic diagram, f cross Is the angle of the joint and the mainCross-attention weights of network data, added to add weight to the primary network feature map, where g represents a transformation of both dimensions to f out Is added up, wherein f cross The method comprises the following steps:
wherein v (T, N, d) is a main network feature graph, N is the number of main network data skeleton nodes, and d represents the feature dimension of each node; a (T, k, m) is a joint angle network feature map, k represents the number of joints of the skeletal joint angle data, and m is its dimension.
Further, the step 6 specifically includes:
step 6.1: the input is firstly carried out to keep a residual error and is connected to the module, the first operation is to carry out space domain graph convolution, then to carry out batch normalization operation BatchNormalization, reLU to activate a layer, a dropout layer with a coefficient of 0.5, then to carry out space graph convolution, then to connect with batch normalization operation Batchnormalization and ReLU activating layers, and the network overall structure is composed of nine space-time convolution modules, a global pooling layer and a Softmax layer.
Step 6.2: the global pooling layer in the network is used for summarizing node characteristics in the graph structure, so that the node-level characteristics are updated to the graph-level characteristics, and then the action numbers of people in the video are output through the Softmax layer.
Compared with the prior art, the human action recognition method based on the graph convolution neural network can recognize and output actions represented by human beings in an input video, has good usability and robustness, and lays a foundation for the actual landing of artificial intelligence technology in the field of action recognition.
Drawings
Fig. 1 is a flowchart of a human motion recognition method based on a graph convolutional neural network according to the present invention.
Fig. 2 is a cross-attention network architecture.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. The specific embodiments described herein are for purposes of illustration only and are not intended to limit the invention.
As shown in fig. 1 and 2, the present embodiment provides a human action recognition method based on a graph convolution neural network, which includes the following steps:
step 1: preparing human action video data, marking the video label according to different kinds of actions, and starting from 0;
step 2.1: performing feature extraction and feature design on the basic data to serve as motion information features;
step 2.1.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.1.2: using the opensense pose estimation algorithm to perform human skeletal key point extraction, we take 15 aliquoting points s= (T1, i.e., T2, i.e., T3, i.e., T4, i.e., T5, i.e., T6, i.e., T15) for the video S, and save the skeletal key point data for each point. 18 bone key points are extracted each time, and respectively represent 18 parts of a human body. Setting the length of a single frame video as L, setting the video width as W, carrying out normalization processing on the extracted skeleton key point coordinates, and using Tn to represent the skeleton key point data of the nth frame, wherein the normalized Tn is as follows:
wherein x is n Is the abscissa of the nth bone key point, y n And Tn is the ordinate of the nth bone key point, and Tn is the normalized bone key point coordinate of the nth frame.
Step 2.1.3: calculating the change speed of key points of adjacent frames, and the speed Vn:
V n =((x 1n -x 1n-1 ,y 1n -y 1n-1 ),(x 2n -x 2n-1 ,y 2n -y 2n-1 ),...,(x 18n -x 18n-1 ,y 18n -y 18n-1 ))
wherein x and y have the same meaning as in step 2.2. After obtaining the speed V, performing feature stitching, wherein the total stitched features Dn are as follows:
D n =Cancate(T n ,T n ′,V n )
wherein T is n And T' n The normalized bone key point coordinates obtained at the side and front of time n are shown, and the candate function shows the concatenation of the variables in brackets.
Step 2.2: the bone point data is further refined to be used as high-order information, and the data in the step 2.1 form a double-flow network to complement each other;
step 2.2.1: because the joint angle is critical to the action category, the key points of the human bones extracted by openpost are screened, and the left knee, the right knee, the left waist, the right waist, the left shoulder, the right shoulder, the left elbow and the right elbow are saved;
step 2.2.2: calculating an included angle:
(1) Knee:
(2) Waist:
(3) Shoulder:
(4) Elbow of hand):
step 3: transmitting the spliced data into a graphic neural network, wherein the graphic neural network structure mainly comprises three parts;
step 3.1: the first part adopts a default human skeleton structure identified by an opensense attitude estimation algorithm as basic connection of a graph neural network, the function of the basic structure of the first part is to adapt to basic motion forms of human beings, and the basic structure has certain modeling capability on any form of motion, and an adjacency matrix of the graph structure is set as A k Representing the k-th network, which is an N x N two-dimensional matrix, where N is equal to 18, representing 18 skeletal keypoints. The A (n 1, n 2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection and a value of 0 represents disconnection;
step 3.2: second part to compensate the fitting ability of the basic structure to the motion diversity, we set the adjacency matrix of the structure as B k It represents the structural adjacency matrix of the k-th layer. The matrix is also an N multiplied by N two-dimensional matrix, the meaning is the same as that of A, except that the matrix has no fixed value, each element of the matrix is a trainable parameter, and the training stage automatically learns which connection modes have better compensation effect on actions;
step 3.3: the third part is a data-driven graph structure, which has different values for each different action, we set the matrix to C k The format is consistent with A and B, C k (n1,n2):
The process is a normalized Gaussian embedding method to calculate the similarity between any two bone key points, θ andtwo embedding methods are respectively adopted, T represents matrix transposition, and the final output dimension is unchanged. The Softmax method changes the final value to 0 and 1, indicating whether or not connected. The output of the final graph neural network is shown as:
wherein f in And f out Respectively representing the input and the output of the layer network, K represents the total layer number of the graph neural network, A matrix B matrix C matrix is described in the steps, and W represents convolution parameters;
step 4: extending the graph convolution from the spatial domain to the temporal domain for point n ij At this time, we define i to represent the ith frame and j to represent the jth skeletal keypoint, and we only refer to the same *** keypoint per time domain convolution, then the formula is:
w is a convolution parameter that is a convolution parameter,the output of the nth layer, other variables are defined as same.
Step 5: the performance of the network is enhanced by using a cross-attention model, which is constructed as shown in fig. 2:
step 5.1: cross attention enhances the expressive power of the main network flow through the feature map of the skeletal joint angle network branches, the formula of which is:
f attention =(1+Attention)*f out
the self-attention model is a residual attention model because as the number of network layers deepens, simple attention stacking will cause some features to disappear.
Step 5.2: the method for calculating the Attention comprises the following steps:
Attention=g(f self ,f cross )*f out
wherein f self Is the self-attention weight of the main network output characteristic diagram, f cross Is the joint angle and the cross attention weight of the main network data, and the joint angle and the cross attention weight are added to form the main networkThe feature map increases the weight. Where g represents transforming both dimensions to f out And added together. Wherein f cross The method comprises the following steps:
wherein v (T, N, d) is a main network feature graph, N is the number of main network data skeleton nodes, and d represents the feature dimension of each node; a (T, k, m) is a joint angle network feature map, k represents the number of joints of the skeletal joint angle data, and m is its dimension. The formula calculates the association between the different nodes of the two networks respectively and uses the association as cross attention.
Step 6: the convolution details of the spatial domain and the time domain are described in detail by the steps 3 and 4, and a space-time diagram convolution module is a complete system. The input is performed first to keep a residual error connected to the module, and finally, the first operation is to perform space domain graph convolution, then batch normalization operation BatchNormalization, reLU to activate layers, dropout layers with coefficients of 0.5, then perform space domain graph convolution, and then connect batch normalization operations Batchnormalization and ReLU activating layers. And the two branches of the network are respectively composed of nine space-time convolution modules, a global pooling layer and a Softmax layer. Finally, the action category is obtained through a Softmax classifier.
Of course, before the graph convolutional neural network of the present embodiment is used to identify human actions, training of the model is performed first, the training part uses Pytorch framework, and uses cross entropy loss function, whose formula is:
Loss=-[ylogy`+(1-y)log(1-y`)]
where y is the label of the sample and y' is the result of our model predictions. We set the batch size to 64 during training, optimize using SGD random gradient descent with momentum of 0.9, and set the weight decay to 0.0001 for a total of 30 epochs.
The technical scheme of the invention is not limited to the specific embodiment, and all technical modifications made according to the technical scheme of the invention fall within the protection scope of the invention.

Claims (4)

1. The human action recognition method based on the graph convolution neural network is characterized by comprising the following steps of:
step 1: preparing human action video data, marking the video data, and marking the video labels according to different kinds of actions;
step 2: extracting skeleton key point features from human action video data by using an openpost attitude estimation algorithm, calculating the change speed of skeleton key points of adjacent frames through a skeleton point mainstream network, and performing feature splicing; screening the bone key points, calculating the included angles of the screened bone key points through an angle branch network, and performing characteristic splicing;
step 2.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;
step 2.2: using an openphase attitude estimation algorithm to extract key points of human bones, taking 15 aliquoting points s= (T1, the term, T2, the term, T3, the term, T4, the term, T5, the term, T6, the term, T15) for the video S, saving the key point data of the bones of each point, extracting 18 key points of bones each time, respectively representing 18 parts of the human body, setting the length of a single frame video as L, setting the video width as W, normalizing the extracted key point coordinates of the bones, and using Tn to represent the key point data of the n-th frame, wherein the normalized Tn:
wherein x is n Is the abscissa of the nth bone key point, y n The ordinate of the nth skeleton key point is Tn, namely the normalized skeleton key point coordinate of the nth frame;
step 2.3: calculating the change speed of key points of adjacent frames, and the speed Vn:
V n =((x 1n -x 1n-1 ,y 1n -y 1n-1 ),(x 2n -x 2n-1 ,y 2n -y 2n-1 ),...,(x 18n -x 18n-1 ,y 18n -y 18n-1 ));
wherein the specific meaning of x and y is the same as in step 2.2; after the speed V is obtained, characteristic splicing is carried out, and the total characteristic Dn after splicing:
D n =Cancate(T n ,T′ n ,V n );
wherein T is n And T' n The normalized bone key point coordinates obtained at the side and the front of the n moment are respectively represented, and the function of candate represents that the variables in brackets are spliced;
step 2.4: screening skeleton key points extracted by openpost, and storing left knee, right knee, left waist, right waist, left shoulder, right shoulder, left elbow and right elbow;
step 2.5: calculating an included angle:
(1) Knee:
(2) Waist:
(3) Shoulder:
(4) Elbow of hand):
step 3: transmitting the spliced data to a graph neural network;
step 4: extending the graph convolution from the spatial domain to the temporal domain;
step 5: using a cross-attention model to enhance the performance of the network;
step 5.1: cross attention enhances the expressive power of the main network flow through the feature map of the skeletal joint angle network branches, the formula of which is:
f attention =(1+Attention)*f out
step 5.2: the method for calculating the Attention comprises the following steps:
Attention=g(f self ,f cross )*f out
wherein f self Is the self-attention weight of the main network output characteristic diagram, f cross Adding the joint angles and the cross-attention weights of the primary network data to add weights to the primary network feature map, where g represents transforming both dimensions to f out Is added up, wherein f cross The method comprises the following steps:
wherein v (T, N, d) is a main network feature graph, N is the number of main network data skeleton nodes, and d represents the feature dimension of each node; a (T, k, m) is a joint angle network feature map, k represents the number of joints of the bone joint angle data, and m is the dimension thereof;
step 6: and constructing a graph convolution neural network consisting of nine space-time convolution modules, a global pooling layer and a Softmax layer, wherein the global pooling layer is used for summarizing node characteristics in a graph structure so as to upgrade the node-level characteristics into the graph-level characteristics, and then outputting the action numbers of people in a human action video through the Softmax layer.
2. The method for identifying human actions based on the graph roll-up neural network according to claim 1, wherein the step 3 specifically comprises:
step 3.1: the default human skeleton structure identified by the openpost posture estimation algorithm is used as the basic connection of the graphic neural network, and the adjacency matrix of the graphic neural network structure is set as A k The adjacency matrix representing the k-th network is an N x N two-dimensional matrix, where N is equal to 18, representing 18 skeletal key points; the A (n 1, n 2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection and a value of 0 represents disconnection;
step 3.2: setting the adjacency matrix of the graph neural network structure as B k It represents the action structure adjacency matrix of the k-th layer; the matrix is also an N x N two-dimensional matrix having the same meaning as a except that the matrix has no fixed value, each element of which is a trainable parameter;
step 3.3: setting the adjacency matrix of the graph neural network structure as C k The format is consistent with A and B, C k (n1,n2):
The process is a normalized Gaussian embedding method to calculate the similarity between any two bone key points, θ andrespectively two embedding methods, wherein T represents matrix transposition, so that the final output dimension is unchanged; the Softmax method changes the final value to 0 and 1, indicating whether or not connected; the output formula of the final graph neural network is:
wherein f in And f out Respectively representing the input and the output of the layer network, K represents the total layer number of the graph neural network, and W represents the convolution parameter.
3. The method for identifying human actions based on the graph roll-up neural network according to claim 2, wherein the step 4 specifically comprises:
for point n ij Definition i represents the ith frame, j represents the jth bone key, each time domain convolution involves only the same bone key, then the formula:
w is a convolution parameter that is a convolution parameter,an output of the n-th layer.
4. A method for identifying human actions based on a graph convolutional neural network according to claim 3, wherein said step 6 specifically comprises:
step 6.1: the input is firstly carried out, a residual error is reserved and connected to the module, the first operation is carrying out space domain graph convolution, then batch normalization operation BatchNormalization, reLU is carried out to activate a layer, a dropout layer with a coefficient of 0.5 is carried out, then space graph convolution is carried out, then batch normalization operation Batchnormalization and ReLU activating layers are connected, and the network overall structure is composed of nine space-time convolution modules, a global pooling layer and a Softmax layer;
step 6.2: the global pooling layer in the network is used for summarizing node characteristics in the graph structure, so that the node-level characteristics are updated to the graph-level characteristics, and then the action numbers of people in the video are output through the Softmax layer.
CN202011600579.7A 2020-12-29 2020-12-29 Human action recognition method based on graph convolution neural network Active CN112633209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011600579.7A CN112633209B (en) 2020-12-29 2020-12-29 Human action recognition method based on graph convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011600579.7A CN112633209B (en) 2020-12-29 2020-12-29 Human action recognition method based on graph convolution neural network

Publications (2)

Publication Number Publication Date
CN112633209A CN112633209A (en) 2021-04-09
CN112633209B true CN112633209B (en) 2024-04-09

Family

ID=75286366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011600579.7A Active CN112633209B (en) 2020-12-29 2020-12-29 Human action recognition method based on graph convolution neural network

Country Status (1)

Country Link
CN (1) CN112633209B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255514B (en) * 2021-05-24 2023-04-07 西安理工大学 Behavior identification method based on local scene perception graph convolutional network
CN113378656B (en) * 2021-05-24 2023-07-25 南京信息工程大学 Action recognition method and device based on self-adaptive graph convolution neural network
CN113361352A (en) * 2021-05-27 2021-09-07 天津大学 Student classroom behavior analysis monitoring method and system based on behavior recognition
CN113392743B (en) * 2021-06-04 2023-04-07 北京格灵深瞳信息技术股份有限公司 Abnormal action detection method, abnormal action detection device, electronic equipment and computer storage medium
CN114613011A (en) * 2022-03-17 2022-06-10 东华大学 Human body 3D (three-dimensional) bone behavior identification method based on graph attention convolutional neural network
CN114998990B (en) * 2022-05-26 2023-07-25 深圳市科荣软件股份有限公司 Method and device for identifying safety behaviors of personnel on construction site
CN115050101B (en) * 2022-07-18 2024-03-22 四川大学 Gait recognition method based on fusion of skeleton and contour features

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532960A (en) * 2019-08-30 2019-12-03 西安交通大学 A kind of action identification method of the target auxiliary based on figure neural network
CN110705463A (en) * 2019-09-29 2020-01-17 山东大学 Video human behavior recognition method and system based on multi-mode double-flow 3D network
CN111709321A (en) * 2020-05-28 2020-09-25 西安交通大学 Human behavior recognition method based on graph convolution neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532960A (en) * 2019-08-30 2019-12-03 西安交通大学 A kind of action identification method of the target auxiliary based on figure neural network
CN110705463A (en) * 2019-09-29 2020-01-17 山东大学 Video human behavior recognition method and system based on multi-mode double-flow 3D network
CN111709321A (en) * 2020-05-28 2020-09-25 西安交通大学 Human behavior recognition method based on graph convolution neural network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Action Recognition Based on Spatial Temporal Graph Convolutional Networks;Wanqiang Zheng等;《Proceedings of the 3rd International Conference on Computer Science and Application EngineeringOctober 2019》;1-5 *
Skeleton-Based Action Recognition With Directed Graph Neural Networks;Lei Shi等;《Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》;7912-7921 *
基于机器视觉的运动姿态分析***研究;陈永康;《中国优秀硕士学位论文全文数据库 社会科学Ⅱ辑》(第2期);H134-354 *
基于深度学习的行为识别算法综述;赫磊;邵展鹏;张剑华;周小龙;;计算机科学(第S1期);149-157 *

Also Published As

Publication number Publication date
CN112633209A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN112633209B (en) Human action recognition method based on graph convolution neural network
CN109685115B (en) Fine-grained conceptual model with bilinear feature fusion and learning method
CN108491880B (en) Object classification and pose estimation method based on neural network
KR102450374B1 (en) Method and device to train and recognize data
Liu et al. Multi-objective convolutional learning for face labeling
CN111291809B (en) Processing device, method and storage medium
Chen et al. A UAV-based forest fire detection algorithm using convolutional neural network
CN110222718B (en) Image processing method and device
CN105631398A (en) Method and apparatus for recognizing object, and method and apparatus for training recognizer
CN108961253A (en) A kind of image partition method and device
CN111582095B (en) Light-weight rapid detection method for abnormal behaviors of pedestrians
CN109919085B (en) Human-human interaction behavior identification method based on light-weight convolutional neural network
WO2021073311A1 (en) Image recognition method and apparatus, computer-readable storage medium and chip
Xia et al. Face occlusion detection based on multi-task convolution neural network
CN108154156B (en) Image set classification method and device based on neural topic model
CN112801015A (en) Multi-mode face recognition method based on attention mechanism
CN110765960B (en) Pedestrian re-identification method for adaptive multi-task deep learning
CN110633624A (en) Machine vision human body abnormal behavior identification method based on multi-feature fusion
CN111400572A (en) Content safety monitoring system and method for realizing image feature recognition based on convolutional neural network
CN110598746A (en) Adaptive scene classification method based on ODE solver
CN114463837A (en) Human behavior recognition method and system based on self-adaptive space-time convolution network
Liu Human face expression recognition based on deep learning-deep convolutional neural network
US20080232682A1 (en) System and method for identifying patterns
CN114170659A (en) Facial emotion recognition method based on attention mechanism
CN117611838A (en) Multi-label image classification method based on self-adaptive hypergraph convolutional network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant