CN112633209B

CN112633209B - Human action recognition method based on graph convolution neural network

Info

Publication number: CN112633209B
Application number: CN202011600579.7A
Authority: CN
Inventors: 毛克明; 李翰鹏
Original assignee: 东北大学
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2024-04-09
Anticipated expiration: 2040-12-29
Also published as: CN112633209A

Abstract

The invention discloses a human action recognition method based on a graph convolution neural network, which comprises the steps of preparing human action video data, marking, and marking video labels according to different kinds of actions; extracting skeleton key point features from human action video data by using an openpost attitude estimation algorithm, calculating the change speed of skeleton key points of adjacent frames through a skeleton point mainstream network, and performing feature splicing; screening the bone key points, calculating the included angles of the screened bone key points through an angle branch network, and performing characteristic splicing; transmitting the spliced data to a graph neural network; extending the graph convolution from the spatial domain to the temporal domain; using a cross-attention model to enhance the performance of the network; human action recognition. The invention can recognize and output the actions expressed by human beings in the input video, has good usability and robustness, and lays a certain foundation for the artificial intelligence technology to actually land in the field of action recognition.

Description

Human action recognition method based on graph convolution neural network

Technical Field

The invention relates to the technical field of computer vision, in particular to a human action recognition method based on a graph convolution neural network.

Background

Artificial intelligence technology has been radiating to various industries, and motion recognition technology is a key technology for a number of popular applications and demands, and has become one of the most interesting directions in the field of computer vision. For example, detection and alarm of abnormal human behaviors in an intelligent monitoring camera, classification and retrieval of human behaviors in a video, including adoption of a motion acquisition technology in a high-image-quality game, can put the motion of professional actors into the game, and bring immersion to players. It is believed that there will be more and more applications for motion recognition technology in the future.

The computer vision field at present often applies similar techniques to human motion recognition directions, which are mainly divided into two methods, one is a video-based method of RGB and optical flow, and the other is a method based on human skeletal key points. The video-based RGB and optical flow method can perform end-to-end learning on a task, but is a very heavy task for extracting optical flow from video, and although various methods for reducing the loss caused by extracting optical flow exist at present, optical flow is a powerful characteristic for motion recognition starting task all the time. The method based on human skeleton key points is a new motion recognition method after the development and maturity of the gesture estimation technology, and compared with the traditional method based on RGB and optical flow of video, the method can be used for more effectively modeling human behaviors, because the traditional method cannot avoid the influence caused by background and light transformation. On the other hand, it requires feature extraction of the video using pose estimation algorithms, which is one more step than the conventional method in this respect. In addition, the existing method for identifying the motion simply utilizes skeleton key point data, and the information describing the motion is not only coordinates, angles and changing speeds thereof, but also an important element of motion identification feature description.

Thus, for the current state of the art, and the complexity of the actions themselves, there is a need for a human action recognition method with a deep learning theoretical basis and more descriptive elements for this task.

Disclosure of Invention

The invention aims at providing a human action recognition method based on a graph convolution neural network aiming at the current situation of the field and the complexity of actions per se.

In order to achieve the above purpose, the invention is implemented according to the following technical scheme:

a human action recognition method based on a graph convolution neural network comprises the following steps:

step 1: preparing human action video data, marking the video data, and marking the video labels according to different kinds of actions;

step 2: extracting skeleton key point features from human action video data by using an openpost attitude estimation algorithm, calculating the change speed of skeleton key points of adjacent frames through a skeleton point mainstream network, and performing feature splicing; screening the bone key points, calculating the included angles of the screened bone key points through an angle branch network, and performing characteristic splicing;

step 3: transmitting the spliced data to a graph neural network;

step 4: extending the graph convolution from the spatial domain to the temporal domain;

step 5: using a cross-attention model to enhance the performance of the network;

step 6: and constructing a graph convolution neural network consisting of nine space-time convolution modules, a global pooling layer and a Softmax layer, wherein the global pooling layer is used for summarizing node characteristics in a graph structure so as to upgrade the node-level characteristics into the graph-level characteristics, and then outputting the action numbers of people in a human action video through the Softmax layer.

Further, the step 2 specifically includes:

step 2.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;

step 2.2: using an openphase attitude estimation algorithm to extract key points of human bones, taking 15 aliquoting points s= (T1, the term, T2, the term, T3, the term, T4, the term, T5, the term, T6, the term, T15) for the video S, saving the key point data of the bones of each point, extracting 18 key points of bones each time, respectively representing 18 parts of the human body, setting the length of a single frame video as L, setting the video width as W, normalizing the extracted key point coordinates of the bones, and using Tn to represent the key point data of the n-th frame, wherein the normalized Tn:

wherein x is _n Is the abscissa of the nth bone key point, y _n The ordinate of the nth skeleton key point is Tn, namely the normalized skeleton key point coordinate of the nth frame;

step 2.3: calculating the change speed of key points of adjacent frames, and the speed Vn:

V _n ＝((x _1n -x _1n-1 ,y _1n -y _1n-1 ),(x _2n -x _2n-1 ,y _2n -y _2n-1 ),...,(x _18n -x _18n-1 ,y _18n -y _18n-1 ))

the method comprises the steps of carrying out a first treatment on the surface of the Wherein the specific meaning of x and y is the same as in step 2.2; after the speed V is obtained, characteristic splicing is carried out, and the total characteristic Dn after splicing:

D _n ＝Cancate(T _n ,T _n ′,V _n )；

wherein T is _n And T _n ' respectively representing normalized bone key point coordinates obtained at the side and the front of the n moment, and the candate function represents splicing the variables in brackets;

step 2.4: screening skeleton key points extracted by openpost, and storing left knee, right knee, left waist, right waist, left shoulder, right shoulder, left elbow and right elbow;

step 2.5: calculating an included angle:

(5) Knee:

(6) Waist:

(7) Shoulder:

(8) Elbow of hand):

further, the step 3 specifically includes:

step 3.1: the default human skeleton structure identified by the openpost posture estimation algorithm is used as the basic connection of the graphic neural network, and the adjacency matrix of the graphic neural network structure is set as A _k The adjacency matrix representing the k-th network is an N x N two-dimensional matrix, where N is equal to 18, representing 18 skeletal key points; the A (n 1, n 2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection and a value of 0 represents disconnection;

step 3.2: setting the adjacency matrix of the graph neural network structure as B _k It represents the action structure adjacency matrix of the k-th layer; the matrix is also an N x N two-dimensional matrix having the same meaning as a except that the matrix has no fixed value, each element of which is a trainable parameter;

step 3.3: setting the adjacency matrix of the graph neural network structure as C _k The format is consistent with A and B, C _k (n1,n2):

The process is a normalized Gaussian embedding method to calculate any two bonesSimilarity between critical points of the ilium, θ andrespectively two embedding methods, wherein T represents matrix transposition, so that the final output dimension is unchanged; the Softmax method changes the final value to 0 and 1, indicating whether or not connected; the output formula of the final graph neural network is:

wherein f _in And f _out Respectively representing the input and the output of the layer network, K represents the total layer number of the graph neural network, and W represents the convolution parameter.

Further, the step 4 specifically includes:

for point n _ij Definition i represents the ith frame, j represents the jth bone key, each time domain convolution involves only the same bone key, then the formula:

w is a convolution parameter that is a convolution parameter,an output of the n-th layer.

Further, the step 5 specifically includes:

step 5.1: cross attention enhances the expressive power of the main network flow through the feature map of the skeletal joint angle network branches, the formula of which is:

f _attention ＝(1+Attention)*f _out ；

step 5.2: the method for calculating the Attention comprises the following steps:

Attention＝g(f _self ,f _cross )*f _out

wherein f _self Is the self-attention weight of the main network output characteristic diagram, f _cross Is the angle of the joint and the mainCross-attention weights of network data, added to add weight to the primary network feature map, where g represents a transformation of both dimensions to f _out Is added up, wherein f _cross The method comprises the following steps:

wherein v (T, N, d) is a main network feature graph, N is the number of main network data skeleton nodes, and d represents the feature dimension of each node; a (T, k, m) is a joint angle network feature map, k represents the number of joints of the skeletal joint angle data, and m is its dimension.

Further, the step 6 specifically includes:

step 6.1: the input is firstly carried out to keep a residual error and is connected to the module, the first operation is to carry out space domain graph convolution, then to carry out batch normalization operation BatchNormalization, reLU to activate a layer, a dropout layer with a coefficient of 0.5, then to carry out space graph convolution, then to connect with batch normalization operation Batchnormalization and ReLU activating layers, and the network overall structure is composed of nine space-time convolution modules, a global pooling layer and a Softmax layer.

Step 6.2: the global pooling layer in the network is used for summarizing node characteristics in the graph structure, so that the node-level characteristics are updated to the graph-level characteristics, and then the action numbers of people in the video are output through the Softmax layer.

Compared with the prior art, the human action recognition method based on the graph convolution neural network can recognize and output actions represented by human beings in an input video, has good usability and robustness, and lays a foundation for the actual landing of artificial intelligence technology in the field of action recognition.

Drawings

Fig. 1 is a flowchart of a human motion recognition method based on a graph convolutional neural network according to the present invention.

Fig. 2 is a cross-attention network architecture.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. The specific embodiments described herein are for purposes of illustration only and are not intended to limit the invention.

As shown in fig. 1 and 2, the present embodiment provides a human action recognition method based on a graph convolution neural network, which includes the following steps:

step 1: preparing human action video data, marking the video label according to different kinds of actions, and starting from 0;

step 2.1: performing feature extraction and feature design on the basic data to serve as motion information features;

step 2.1.1: firstly, cutting videos to ensure that human beings in each video are positioned in the center of the video;

step 2.1.2: using the opensense pose estimation algorithm to perform human skeletal key point extraction, we take 15 aliquoting points s= (T1, i.e., T2, i.e., T3, i.e., T4, i.e., T5, i.e., T6, i.e., T15) for the video S, and save the skeletal key point data for each point. 18 bone key points are extracted each time, and respectively represent 18 parts of a human body. Setting the length of a single frame video as L, setting the video width as W, carrying out normalization processing on the extracted skeleton key point coordinates, and using Tn to represent the skeleton key point data of the nth frame, wherein the normalized Tn is as follows:

wherein x is _n Is the abscissa of the nth bone key point, y _n And Tn is the ordinate of the nth bone key point, and Tn is the normalized bone key point coordinate of the nth frame.

Step 2.1.3: calculating the change speed of key points of adjacent frames, and the speed Vn:

wherein x and y have the same meaning as in step 2.2. After obtaining the speed V, performing feature stitching, wherein the total stitched features Dn are as follows:

D _n ＝Cancate(T _n ,T _n ′,V _n )

wherein T is _n And T' _n The normalized bone key point coordinates obtained at the side and front of time n are shown, and the candate function shows the concatenation of the variables in brackets.

Step 2.2: the bone point data is further refined to be used as high-order information, and the data in the step 2.1 form a double-flow network to complement each other;

step 2.2.1: because the joint angle is critical to the action category, the key points of the human bones extracted by openpost are screened, and the left knee, the right knee, the left waist, the right waist, the left shoulder, the right shoulder, the left elbow and the right elbow are saved;

step 2.2.2: calculating an included angle:

(1) Knee:

(2) Waist:

(3) Shoulder:

(4) Elbow of hand):

step 3: transmitting the spliced data into a graphic neural network, wherein the graphic neural network structure mainly comprises three parts;

step 3.1: the first part adopts a default human skeleton structure identified by an opensense attitude estimation algorithm as basic connection of a graph neural network, the function of the basic structure of the first part is to adapt to basic motion forms of human beings, and the basic structure has certain modeling capability on any form of motion, and an adjacency matrix of the graph structure is set as A _k Representing the k-th network, which is an N x N two-dimensional matrix, where N is equal to 18, representing 18 skeletal keypoints. The A (n 1, n 2) position represents the connection state of the n1 and n2 positions, a value of 1 represents connection and a value of 0 represents disconnection;

step 3.2: second part to compensate the fitting ability of the basic structure to the motion diversity, we set the adjacency matrix of the structure as B _k It represents the structural adjacency matrix of the k-th layer. The matrix is also an N multiplied by N two-dimensional matrix, the meaning is the same as that of A, except that the matrix has no fixed value, each element of the matrix is a trainable parameter, and the training stage automatically learns which connection modes have better compensation effect on actions;

step 3.3: the third part is a data-driven graph structure, which has different values for each different action, we set the matrix to C _k The format is consistent with A and B, C _k (n1,n2):

The process is a normalized Gaussian embedding method to calculate the similarity between any two bone key points, θ andtwo embedding methods are respectively adopted, T represents matrix transposition, and the final output dimension is unchanged. The Softmax method changes the final value to 0 and 1, indicating whether or not connected. The output of the final graph neural network is shown as:

wherein f _in And f _out Respectively representing the input and the output of the layer network, K represents the total layer number of the graph neural network, A matrix B matrix C matrix is described in the steps, and W represents convolution parameters;

step 4: extending the graph convolution from the spatial domain to the temporal domain for point n _ij At this time, we define i to represent the ith frame and j to represent the jth skeletal keypoint, and we only refer to the same *** keypoint per time domain convolution, then the formula is:

w is a convolution parameter that is a convolution parameter,the output of the nth layer, other variables are defined as same.

Step 5: the performance of the network is enhanced by using a cross-attention model, which is constructed as shown in fig. 2:

f _attention ＝(1+Attention)*f _out

the self-attention model is a residual attention model because as the number of network layers deepens, simple attention stacking will cause some features to disappear.

Attention＝g(f _self ,f _cross )*f _out

wherein f _self Is the self-attention weight of the main network output characteristic diagram, f _cross Is the joint angle and the cross attention weight of the main network data, and the joint angle and the cross attention weight are added to form the main networkThe feature map increases the weight. Where g represents transforming both dimensions to f _out And added together. Wherein f _cross The method comprises the following steps:

wherein v (T, N, d) is a main network feature graph, N is the number of main network data skeleton nodes, and d represents the feature dimension of each node; a (T, k, m) is a joint angle network feature map, k represents the number of joints of the skeletal joint angle data, and m is its dimension. The formula calculates the association between the different nodes of the two networks respectively and uses the association as cross attention.

Step 6: the convolution details of the spatial domain and the time domain are described in detail by the steps 3 and 4, and a space-time diagram convolution module is a complete system. The input is performed first to keep a residual error connected to the module, and finally, the first operation is to perform space domain graph convolution, then batch normalization operation BatchNormalization, reLU to activate layers, dropout layers with coefficients of 0.5, then perform space domain graph convolution, and then connect batch normalization operations Batchnormalization and ReLU activating layers. And the two branches of the network are respectively composed of nine space-time convolution modules, a global pooling layer and a Softmax layer. Finally, the action category is obtained through a Softmax classifier.

Of course, before the graph convolutional neural network of the present embodiment is used to identify human actions, training of the model is performed first, the training part uses Pytorch framework, and uses cross entropy loss function, whose formula is:

Loss＝-[ylogy`+(1-y)log(1-y`)]

where y is the label of the sample and y' is the result of our model predictions. We set the batch size to 64 during training, optimize using SGD random gradient descent with momentum of 0.9, and set the weight decay to 0.0001 for a total of 30 epochs.

The technical scheme of the invention is not limited to the specific embodiment, and all technical modifications made according to the technical scheme of the invention fall within the protection scope of the invention.

Claims

1. The human action recognition method based on the graph convolution neural network is characterized by comprising the following steps of:

V _n ＝((x _1n -x _1n-1 ,y _1n -y _1n-1 ),(x _2n -x _2n-1 ,y _2n -y _2n-1 ),...,(x _18n -x _18n-1 ,y _18n -y _18n-1 ))；

wherein the specific meaning of x and y is the same as in step 2.2; after the speed V is obtained, characteristic splicing is carried out, and the total characteristic Dn after splicing:

D _n ＝Cancate(T _n ,T′ _n ,V _n )；

wherein T is _n And T' _n The normalized bone key point coordinates obtained at the side and the front of the n moment are respectively represented, and the function of candate represents that the variables in brackets are spliced;

step 2.5: calculating an included angle:

(1) Knee:

(2) Waist:

(3) Shoulder:

(4) Elbow of hand):

step 3: transmitting the spliced data to a graph neural network;

f _attention ＝(1+Attention)*f _out ；

Attention＝g(f _self ,f _cross )*f _out

wherein f _self Is the self-attention weight of the main network output characteristic diagram, f _cross Adding the joint angles and the cross-attention weights of the primary network data to add weights to the primary network feature map, where g represents transforming both dimensions to f _out Is added up, wherein f _cross The method comprises the following steps:

wherein v (T, N, d) is a main network feature graph, N is the number of main network data skeleton nodes, and d represents the feature dimension of each node; a (T, k, m) is a joint angle network feature map, k represents the number of joints of the bone joint angle data, and m is the dimension thereof;

2. The method for identifying human actions based on the graph roll-up neural network according to claim 1, wherein the step 3 specifically comprises:

The process is a normalized Gaussian embedding method to calculate the similarity between any two bone key points, θ andrespectively two embedding methods, wherein T represents matrix transposition, so that the final output dimension is unchanged; the Softmax method changes the final value to 0 and 1, indicating whether or not connected; the output formula of the final graph neural network is:

3. The method for identifying human actions based on the graph roll-up neural network according to claim 2, wherein the step 4 specifically comprises:

4. A method for identifying human actions based on a graph convolutional neural network according to claim 3, wherein said step 6 specifically comprises:

step 6.1: the input is firstly carried out, a residual error is reserved and connected to the module, the first operation is carrying out space domain graph convolution, then batch normalization operation BatchNormalization, reLU is carried out to activate a layer, a dropout layer with a coefficient of 0.5 is carried out, then space graph convolution is carried out, then batch normalization operation Batchnormalization and ReLU activating layers are connected, and the network overall structure is composed of nine space-time convolution modules, a global pooling layer and a Softmax layer;