CN113378656A

CN113378656A - Action identification method and device based on self-adaptive graph convolution neural network

Info

Publication number: CN113378656A
Application number: CN202110564099.8A
Authority: CN
Inventors: 胡凯; 丁益武; 陆美霞; 黄昱锟
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2021-09-10
Anticipated expiration: 2041-05-24
Also published as: CN113378656B

Abstract

The invention discloses a motion recognition method and a device based on a self-adaptive graph convolution neural network, wherein the method comprises the following steps: s1, generating a human skeleton data set; s2, taking the angle between the adjacent bone edges as the spatial feature of the deep layer; s3, calculating the average energy change value of each key node, and taking the average energy change value as the deep time characteristic; s4, constructing a dual-flow graph convolutional neural network; s5, expanding the double-flow graph convolution neural network, connecting 2 newly-added sub-networks in parallel, and constructing an action recognition model, wherein the newly-added sub-networks are respectively used for processing the spatial characteristics and the time characteristics; the action recognition model is used for simultaneously processing joint data, skeleton data, deep spatial features and deep temporal features, and calculating to obtain corresponding action types. The invention can effectively improve the recognition precision of the graph convolution network in the field of action recognition.

Description

Action identification method and device based on self-adaptive graph convolution neural network

Technical Field

The invention relates to the technical field of video flow recognition, in particular to a motion recognition method and device based on an adaptive graph convolution neural network.

Background

In the field of machine learning, motion recognition is a very important task, and many scenes such as automatic driving, human-computer interaction, public safety and the like can be used in daily life, so that the task is paid more and more attention to people. At present, due to the explosive development of machine learning and deep learning in recent years, many motion recognition algorithms with excellent performance are emerged, and the motion recognition algorithm based on the space-time diagram convolution achieves excellent performance.

The existing action recognition algorithm based on the graph neural network only utilizes the characteristics of a very shallow layer, firstly, the coordinates of key points of a human body obtained by a posture estimation algorithm and the confidence coefficient of the coordinates are directly utilized as the characteristics, and the position relation between the key points and the bones is neglected. For example, for the key point at the shoulder, it depends on where the upper body is located, while it determines the position of the upper arm; secondly, there is no obvious distinction between the duration of the action, such as falling and lying down, the action is similar, and it is obvious that the falling is faster than the lying down. The existence of these problems indicates that the existing methods still do not sufficiently extract information of data.

Thus, while skeleton-based motion recognition algorithms have achieved excellent results on open datasets, current algorithms all utilize only relatively shallow features, do not consider the association of skeleton data nodes with edges, do not consider the association of edges, and do not have an effective solution to the indistinguishable problem of motions like "fall" and "lie down".

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a motion recognition method and a device based on a self-adaptive graph convolution neural network.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, an embodiment of the present invention provides a motion recognition method based on an adaptive graph convolution neural network, where the motion recognition method includes:

s1, acquiring video stream data of the human body action type to be identified, processing the imported video stream data by adopting the existing posture estimation algorithm to obtain human body skeleton type data and human body skeleton graphics, generating and simultaneously obtaining coordinates and confidence characteristics of each key node, and generating a human body skeleton data set;

s2, calculating the change of angular momentum when the skeleton rotates around a key node in the process of human motion, and taking the angle between adjacent skeleton edges as a deep spatial feature;

s3, extracting energy information in the action duration time of the human body, accumulating the angle difference generated by the rotation of the skeleton around the key nodes to obtain the sum of angle change in the action duration time, dividing the accumulated sum of the angle difference corresponding to each key node by the number of the key frames of the current action, and calculating to obtain the average energy change value of each key node as the deep time characteristic;

s4, constructing a dual-flow graph convolutional neural network, wherein joint data and bone data are respectively used as input data of J flow and B flow, and predicted action labels are used as output data;

s5, expanding the double-flow graph convolution neural network, connecting 2 newly-added sub-networks in parallel, and constructing an action recognition model, wherein the newly-added sub-networks are respectively used for processing the spatial characteristics and the time characteristics; the action recognition model is used for simultaneously processing joint data, skeleton data, deep spatial features and deep temporal features, and calculating to obtain corresponding action types.

Optionally, in step S2, the step of calculating the change of angular momentum when the bone rotates around the key node in the human motion process, and using the variable of the angle between the adjacent bone edges as the deep spatial feature includes the following steps:

s21, calculating angles between all adjacent bones according to the coordinates and physical connection of each key node in the human skeleton data set; when the degree of the node is 1, the node only has one edge and does not calculate an angle; when the degree of the node is 2, namely one node is connected with two edges, an angle smaller than 180 degrees is calculated; when the degree of the node is 3, namely one node is connected with 3 edges, calculating 3 angles; when the degree of the node is 4, 4 angles are calculated;

s22, aiming at all angles in n key frames in the whole action duration, combining the calculated angles into a matrix form according to the sequence of key nodes and video frames, and expanding the obtained angle matrix to be:

wherein m is the total number of angles,

is the value of the ith angle in the jth key frame, i is 1,2, …, m, j is 1,2, …, n;

s23, subtracting the corresponding angle of the corresponding key point of the previous frame from the angle of any key point of the next frame to obtain the angle difference formed by the surrounding edge of the same node between the adjacent frames; calculating an angle difference matrix delta theta formed by surrounding bones of adjacent frames by taking the same node as a central point:

in the formula (I), the compound is shown in the specification,

is the value of the mth angle in the (n-1) th key frame.

Optionally, in step S3, the extracting energy information within the action duration of the human body, accumulating the angle differences generated by the rotation of the bone around the key nodes to obtain a total of angle changes within the action duration, and dividing the accumulated total of angle differences corresponding to each key node by the number of key frames of the current action to obtain an average energy change value of each key node, where the process of using the average energy change value as the deep-level time feature includes the following steps:

s31, accumulating and summing the calculated angle difference matrix delta theta according to the time sequence to obtain the angle change sum theta on each node_I，θ_IThe expression of (A) is as follows:

wherein the subscript "1 to m-1" represents the number of the key node,

"1-n-1" in the superscript represents key frames, constituting a 1 × (m-1) energy matrix θ_I；

S21, theta obtained in the step S31_IDividing the current action frame number to obtain the average energy theta of the current action_aWherein

And n is the number of key frames extracted by the attitude estimation algorithm.

Optionally, in step S4, the process of constructing the dual-flow graph convolutional neural network includes the following steps:

step 4.1: building a self-adaptive graph volume layer; the adaptive graph convolution layer is used for optimizing the topology of the network together with other parameters of the network in an end-to-end learning mode, the skeleton graph is unique to different layers and samples, and the topology of the graph is formed by an adjacency matrix A_kSum mask M_kDetermination of A_kDetermining whether a connection exists between two vertices, M_kDetermine the strength of the connection, are obtainedThe following expression forms:

in the formula, K_vKernel size, representing spatial dimension, set to 3; w_kIs a weight matrix, k ∈ [0,3 ]]；

Represents the normalized diagonal matrix, A_kAn N × N adjacency matrix representing the physical structure of the human body; b is_kIs an N × N adjacency matrix, B_kElements in (A) are trained and optimized along with the adaptive graph convolution layer (B)_kThe value of (a) is not limited, and the elements in the matrix are arbitrary values that indicate the presence and strength of a connection between two joints; c_kIs a data correlation graph used to learn a unique graph for each sample;

to determine whether a connection exists between two vertices and the strength of the connection, a normalized embedded gaussian function is used to calculate the similarity between the two vertices:

wherein N represents the total number of keypoints, v_iAnd v_jCharacteristic information on the node;

given a feature matrix input, two embedding functions θ (-) and

dimension of input from C_inX T X N to C_exTxN, the two embedded feature matrices being rearranged and reshaped to an NxC_eMatrix of T and one C_eT multiplied by N matrix, which is transformed into a similarity momentArray, calculating C using the following formula_k：

In the formula, W_θAnd

is an embedding function theta (-) and

the parameters of (1);

step 4.2: building an adaptive graph convolution module; the self-adaptive graph convolution module comprises a space graph convolution layer convs, a time graph convolution layer convt, an additional random discarding treatment Dropout and a residual connection which are sequentially connected; wherein, Dropout rate is set to 0.5; a batch of standardization layers and an activation function layer are respectively connected behind the space map convolution layer convs and the time map convolution layer convt;

step 4.3: stacking the self-adaptive graph convolution modules to build a self-adaptive graph convolution network; the adaptive graph convolution network comprises 9 adaptive graph convolution modules, wherein the number of output channels of each adaptive graph convolution module is respectively 64,128, 256 and 256; adding a BN layer of data to normalize the input data at the beginning, performing a global average pooling layer to pool feature maps of different samples to the same size, the final output being sent to a SoftMax classifier to obtain a prediction;

step 4.4: building a dual-flow graph convolutional neural network;

calculating data of joints and data of bones, inputting the joint data and the bone data into J flow and B flow respectively, adding SoftMax scores of the two flows to obtain a fusion score and predicting an action label.

Optionally, in step S5, the step of calculating the corresponding action type includes:

s51, expanding the double-flow graph convolution neural network, connecting 2 newly-added sub-networks in parallel on the basis of 2 existing sub-networks of the double-flow graph convolution neural network, and building an action recognition model;

s52, respectively introducing the bone data, the joint data, the angle change between bones and the energy generated by the motion into four sub-networks of the motion recognition model to obtain corresponding prediction scores; the action recognition model also comprises an accumulator and a SoftMax classifier, and after the 4 prediction scores are added by the accumulator, the accumulated result is led into the SoftMax classifier to obtain a final classification result; the final classification result S is calculated as:

S＝S₁W₁+S₂W₂+S₃W₃+S₄W₄

in the formula, S₁、S₂、S₃、S₄The predicted score results are respectively 4 sub-networks; w₁、W₂、W₃、W₄Is the weight of 4 sub-networks and is a hyper-parameter.

In a second aspect, an embodiment of the present invention provides a motion recognition apparatus based on an adaptive graph-convolution neural network, where the motion recognition apparatus includes:

the human body skeleton data set generation module is used for acquiring video stream data of human body action types to be identified, processing the imported video stream data by adopting the existing posture estimation algorithm to obtain human body skeleton type data and human body skeleton graphs, generating and simultaneously obtaining coordinates and confidence coefficient characteristics of each key node, and generating a human body skeleton data set;

the spatial feature extraction module is used for calculating the change of angular momentum when bones rotate around key nodes in the human motion process, and taking the variable of the angle between adjacent bone edges as a deep spatial feature;

the time characteristic extraction module is used for extracting energy information in the action duration time of a human body, accumulating the angle difference generated by the rotation of the skeleton around the key nodes to obtain the total of angle change in the action duration time, dividing the accumulated sum of the angle difference corresponding to each key node by the number of key frames of the current action, and calculating to obtain the average energy change value of each key node as the deep time characteristic;

the double-flow graph convolution neural network construction module is used for constructing a double-flow graph convolution neural network, wherein joint data and bone data are respectively used as input data of J flow and B flow, and a predicted action tag is used as output data;

the action recognition model building module is used for expanding the double-flow graph convolution neural network, connecting 2 newly-added sub-networks in parallel and building an action recognition model, wherein the 2 newly-added sub-networks are respectively used for processing the spatial characteristics and the temporal characteristics;

and the action recognition model is used for simultaneously processing joint data, bone data, deep spatial features and deep temporal features and calculating to obtain corresponding action types.

Optionally, the dual-flow graph convolutional neural network includes 2 sub-networks; the joint data and the bone data are respectively used as input data of 2 sub-networks, and corresponding prediction scores are obtained after sub-network processing.

Optionally, each of the sub-networks or the newly added sub-networks includes 9 adaptive map convolution modules, and the number of output channels of each adaptive map convolution module is 64,128, 256, and 256, respectively; adding a BN layer of data to normalize the input data at the beginning, performing a global average pooling layer to pool feature maps of different samples to the same size, the final output being sent to a SoftMax classifier to obtain a prediction;

the self-adaptive graph convolution module comprises a space graph convolution layer convs, a time graph convolution layer convt, an additional random discarding treatment Dropout and a residual connection which are sequentially connected; wherein, Dropout rate is set to 0.5; after the space map convolution layer convs and the time map convolution layer convt, a batch normalization layer and an activation function layer are respectively connected.

In a third aspect, the present embodiment refers to an electronic device comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement an adaptive graph convolution neural network based action recognition method as previously described.

In a third aspect, the present embodiment refers to a computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method for motion recognition based on an adaptive graph convolution neural network as described above.

The invention has the beneficial effects that:

on one hand, the invention is inspired by angular momentum in robot dynamics, and calculates the change of the angular momentum when bones rotate around key points in the process of human motion, so that the variable of the angle between adjacent bone edges is introduced as a deep spatial feature; on the other hand, energy information in the action duration of the human body is extracted, the obtained angle differences are accumulated to obtain the total sum of angle changes in the action duration, and finally the angle sum on each node is divided by the number of key frames of the current action, and the obtained result is used as deep time characteristics. By adding the spatial characteristic of angle change and the temporal characteristic of average energy change, the final classification accuracy can be greatly improved by the action recognition model, and the advantages of a skeleton data set in the action recognition field are fully utilized by time-space combination, so that the conventional double-flow self-adaptive graph convolution network is more suitable for the task of action recognition.

Drawings

Fig. 1 is a flowchart of an action recognition method based on an adaptive graph convolution neural network according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of node labels of the public data set NTU-RGB + D data set according to the embodiment of the present invention.

FIG. 3 is a diagram of an adaptive graph rolling module according to an embodiment of the invention.

Fig. 4 is a schematic diagram of an adaptive graph convolution network according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a dual-stream adaptive graph convolution network according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a motion recognition model according to an embodiment of the present invention.

Fig. 7 is a schematic view of an identification process of the motion identification model according to the embodiment of the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings.

It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.

Example one

Fig. 1 is a flowchart of an action recognition method based on an adaptive graph-convolution neural network according to an embodiment of the present invention. The present embodiment may be used for recognizing human body motion in a video stream through a server or other devices, and the method may be performed by an adaptive graph-convolution neural network-based motion recognition apparatus, which may be implemented in software and/or hardware, and may be integrated in an electronic device, such as an integrated server device.

Referring to fig. 1, the motion recognition method includes:

s1, obtaining video stream data of the human body action type to be identified, processing the imported video stream data by adopting the existing posture estimation algorithm to obtain human body skeleton type data and human body skeleton graphics, generating and simultaneously obtaining the coordinates and confidence coefficient characteristics of each key node, and generating a human body skeleton data set.

Specifically, the existing posture estimation algorithm is used for processing video stream data into data of a human skeleton type, so that a human skeleton graph is obtained, meanwhile, the characteristics of coordinates, confidence coefficient and the like of each key point are obtained, and the data are stored in a text file for use in the subsequent steps.

For convenience of description, the algorithm is tested by taking 10 videos as an example, actions of people in the videos cover actions in the NTU-RGB + D data set, and finally a node label of the NTU-RGB + D data set shown in fig. 2 is obtained, in fig. 2, there are 25 key nodes, and serial numbers are 1 to 25 respectively.

S2, calculating the change of angular momentum when the skeleton rotates around the key node in the process of human motion, and taking the angle between the adjacent skeleton edges as the deep spatial feature.

The purpose of step S2 is to calculate the angle Δ θ of change between all key points and the bone. The innovation point of the step 2 is that: the position relation between bones and bones when the human body moves is fully utilized, the bone graph extracted by the human body action is similar to joints defined in robotics, so that the angular momentum variable in the robot dynamics is introduced, the information of data is fully utilized by calculating the angular momentum change generated when the human body moves, the quality of a human cannot be estimated, only the angle is reserved, the relation between key points and the joints when the human body moves is represented by the angle, and the spatial information of an algorithm is expanded. The method specifically comprises the following steps:

step 2.1: and calculating the included angle between the bones. Calculating the angle between two adjacent skeletons according to the coordinate and physical connection of each node in the human skeleton data set, wherein when the degree of the node is 1, namely the node only has one edge, the angle does not need to be calculated; when the degree of the node is 2, namely one node is connected with two edges, only an angle smaller than 180 degrees needs to be calculated; when the degree of a node is 3, namely one node is connected with 3 edges, 3 angles need to be calculated; similarly, a node with 4 degrees needs to calculate 4 angles. As shown in FIG. 2, taking node 21 as an example, 4 angles formed between 4 bones of the node are calculated

Wherein "tl" represents an angle of top left (top right), "tr" represents an angle of top right (top right), "ll" represents an angle of bottom left (lower left), "lr" represents an angle of bottom right (lower right), and "21" represents a node centered on the 21 st node; similarly, nodes with redundancy 2 and 3 are calculated in the same manner. The calculation formulas (1) to (4) of 4 angles around the node 21 as the center point are as followsShown below:

"x" and "y" in the above formula indicate the horizontal and vertical coordinates of the node, and the subscript thereof indicates the reference numeral of the node. All other nodes with the degree of more than 2 can calculate the angles of all adjacent bones by modifying the coordinates of the nodes by referring to the formula.

Step 2.2: and combining the 26 angles obtained in the step 2.1 into a matrix form according to the sequence of the nodes. Arranging all angles in the first frame according to the node sequence, if nodes with the degree of more than 2 are arranged according to the principle of from left to right and from top to bottom, taking the first frame as an example, and obtaining an angle matrix of

Permutation, where "1" in the superscript represents the first frame, and "1, 2, … …, 26" in the subscript, there are a total of 26 angles in the first frame. All angles of n frames for the entire duration of the action are combined as described above and then expanded by rows into an angle matrix as shown below:

step 2.3: and calculating the change delta theta of the angle formed by the surrounding bones by taking the same node as a central point between adjacent frames. Subtracting all angles of the previous frame from the angles of all key points of the next frame to obtain an angle difference formed by the peripheral edge of the same node between adjacent frames, and calculating the angle difference by using the angle matrix known in the step 2.2 to obtain an angle difference matrix delta theta, wherein the matrix expression form is as follows:

in the formula (I), the compound is shown in the specification,

is the value of the mth angle in the (n-1) th key frame.

And S3, extracting energy information in the action duration time of the human body, accumulating the angle difference generated by the rotation of the skeleton around the key nodes to obtain the sum of angle change in the action duration time, dividing the accumulated sum of the angle difference corresponding to each key node by the number of the key frames of the current action, and calculating to obtain the average energy change value of each key node as the deep-level time characteristic.

Step S3 is initiated by the concept of integration in mathematics to calculate the total energy θ generated during the duration of the motion_IAnd dividing the secondary result by the number of the key frames extracted by the attitude estimation algorithm to obtain the average energy change. The innovative point of step S3 is to represent the sum of the energies generated by the whole set of actions by calculating the sum of the angular changes obtained after the action is completed, and then further dividing θ_IThe average energy change is obtained by dividing the number of the key frames, so that the operation can further utilize the time characteristic of the skeleton data set, and the operation is an effective solution which is not easy to distinguish and solves similar actions such as falling and lying down. The method specifically comprises the following steps:

step 3.1: accumulating and summing the angle matrix delta theta obtained by calculation in the step 2.3 according to the time sequence to obtain the angle change sum theta on each node_I，θ_IThe expression of (A) is as follows:

wherein the subscript "1-25" represents the node designation,

in the superscript "1 to n-1" represent key frames, "θ₂～θ₂₅"also added in the above-described manner to finally form a 1 × 25 energy matrix θ_I。

Step 3.2: theta obtained in step 3.1_IDividing the current action frame number to obtain the average energy theta of the action_aWherein

Where n represents the number of key frames extracted by the attitude estimation algorithm, "θ₁～θ₂₅"is the sum over each node calculated in step 3.1.

And S4, constructing a dual-flow graph convolutional neural network, wherein joint data and bone data are used as input data of J flow and B flow respectively, and predicted action labels are used as output data. The method specifically comprises the following steps:

step 4.1: an adaptive graph convolution layer (AGC) is built, the topology structure of the network and other parameters of the network are optimized together in an end-to-end learning mode, and a skeleton graph is unique to different layers and samples, so that the flexibility of the model is greatly improved. More specifically, the topology of the graph is actually determined by the adjacency matrix and the mask, i.e., A_kAnd M_k。A_kDetermining whether a connection exists between two vertices, M_kThe strength of the connection is determined. Thus, an expression form as in formula (5) is obtained:

in the above formula K_vKernel size, representing spatial dimensions, set to 3, k ∈ [0,3 ]]，W_kIs a matrix of weights that is a function of,

A_ksimilar to an N adjacency matrix, wherein

Is a normalized diagonal matrix, alpha is set to 0.001, in order to prevent empty rows; a. the_kAn N × N adjacency matrix representing the physical structure of the human body; b is_kAlso representing an N adjacency matrix, but with A_kIn contrast, during the training process, B is adjusted_kThe elements in (1) are trained and optimized together; b is_kWithout limitation, this means that the graph is completely learned from the training data, and in this data-driven manner, the model can learn a graph that is completely specific to the recognition task and is more personalized to the different information contained in the different layers. The element in the matrix may be any value and it not only indicates the presence of a connection between two joints, but also the strength of the connection, which is related to M_kThe attention mechanism performed is the same; c_kIs a data correlation graph which learns a unique graph for each sample, and in order to determine whether a connection exists between two vertexes and the strength of the connection, the similarity between the two vertexes is calculated by using a normalized embedded gaussian function as shown in formula (6):

where N represents the total number of keypoints, v_iAnd v_jCharacteristic information on the node. More specifically, given a feature matrix input, two embedding functions θ (-) and

dimension of input from C_inX T X N to C_exTxN, the two embedded feature matrices being rearranged and reshaped to an NxC_eMatrix of T and one C_eMultiplying T multiplied by N to form a similar matrix, and calculating C by using formula (7)_k：

W in the above formula_θAnd

is an embedding function theta (-) and

the parameter (c) of (c).

Step 4.2: and building an adaptive graph convolution module. The convolution in the time dimension is the same as the classic algorithm space-time graph convolution network (ST-GCN), and both the space graph convolution network layer and the time graph convolution network layer are followed by a Batch Normalization (BN) layer and an activation function (ReLU) layer. As shown in fig. 3, a basic block is a combination of a spatial map convolutional layer (convs), a temporal map convolutional layer (convt) and an additional random discard process (Dropout), the Dropout rate is set to 0.5, and a residual connection is added for each block for stable training.

Step 4.3: and building an adaptive graph convolution network. The Adaptive Graph Convolution Network (AGCN) is stacked for step 4.2 as shown in fig. 4, for a total of 9 modules, each having 64, 64,128, 256 and 256 output channels. A data BN layer is added at the beginning to normalize the input data, a Global average pooling layer (Global MaxPooling) is performed to pool feature maps of different samples to the same size. The final output is sent to the SoftMax classifier to obtain the prediction.

Step 4.4: and building a double-flow network. Referring to fig. 5, the data of the joints and the data of the bones are calculated first, then the joint data and the bone data are input into the J stream and the B stream respectively, and finally the SoftMax scores of the two streams are added to obtain a fusion score and predict an action label.

S5, expanding the double-flow graph convolution neural network, connecting 2 newly-added sub-networks in parallel, and constructing an action recognition model, wherein the newly-added sub-networks are respectively used for processing the spatial characteristics and the time characteristics; the action recognition model is used for simultaneously processing joint data, skeleton data, deep spatial features and deep temporal features, and calculating to obtain corresponding action types. And modifying the structure of the network again, and expanding the feature input on the basis of keeping the feature extraction method of the original model. The model consists of 4 sub-networks, wherein 2 sub-networks are kept unchanged on the basis of the original double-flow self-adaptive graph convolution neural network, and the other 2 sub-networks are respectively used for extracting the characteristics of space and time, and the specific steps comprise:

step 5.1: and building a spatio-temporal feature expansion graph convolution neural network model. Based on the double-flow adaptive graph convolution network described in step 4, the motion recognition model of this embodiment is shown in fig. 6. The motion recognition model consists of 4 sub-networks, 2 sub-networks keep the existing double-flow self-adaptive graph convolution network unchanged, and the rest 2 sub-networks are used for extracting features in space and time. And finally, each sub-network obtains a predicted score through a SoftMax classifier, and then the 4 scores are added to obtain a final classification result. The final classification score is S, which is expressed as equation (8):

S＝S₁W₁+S₂W₂+S₃W₃+S₄W₄ (8)

s in the above formula₁、S₂、S₃、S₄Represents the predicted score results for 4 subnetworks, W₁、W₂、W₃、W₄The weights representing them are hyper-parameters whose magnitude can be adjusted according to the result.

Step 5.2: the model of this patent is trained. Firstly, preprocessing data, recombining data structures in an NTU-RGB + D public data set, and solving an angle difference matrix delta theta and an average energy change matrix theta according to formulas in the step 2 and the step 3_a(ii) a Then, the values of delta theta and theta are adjusted_aTwo space-time feature matrices are input into an adaptive graph convolution neural network model, and the modelThe optimization strategy adopts a Stochastic Gradient Descent (SGD) method with Nesterov momentum of 0.9, the Batch Size (Batch _ Size) is set to be 64, the weight attenuation is set to be 0.0001, the training times are set to be 64 times, the other two sub-networks are used for calculating the data of the original 2S-AGCN, and finally the classification scores calculated by the 4 networks are added to obtain the final total score, and the classification result is finally obtained. The training flow chart is shown in fig. 7.

Example two

The embodiment provides an action recognition device based on an adaptive graph convolution neural network, which comprises a human skeleton data set generation module, a spatial feature extraction module, a temporal feature extraction module, a double-flow graph convolution neural network construction module, an action recognition model construction module and an action recognition model.

And the human body skeleton data set generating module is used for acquiring video stream data of the human body action types to be identified, processing the imported video stream data by adopting the existing posture estimation algorithm to obtain human body skeleton type data and human body skeleton graphs, generating and simultaneously obtaining the coordinates and confidence characteristics of each key node, and generating a human body skeleton data set.

And the spatial feature extraction module is used for calculating the change of angular momentum when the bones rotate around the key nodes in the human motion process, and taking the variable of the angle between the adjacent bone edges as the deep spatial feature.

And the time characteristic extraction module is used for extracting energy information in the action duration time of the human body, accumulating the angle difference generated by the rotation of the skeleton around the key nodes to obtain the total of the angle change in the action duration time, dividing the accumulated sum of the angle difference corresponding to each key node by the number of the key frames of the current action, and calculating to obtain the average energy change value of each key node as the deep time characteristic.

The double-flow graph convolution neural network construction module is used for constructing a double-flow graph convolution neural network, wherein joint data and bone data are respectively used as input data of J flow and B flow, and predicted action labels are used as output data.

And the action recognition model building module is used for expanding the double-flow graph convolution neural network, connecting 2 newly-added sub-networks in parallel and building an action recognition model, and the 2 newly-added sub-networks are respectively used for processing the spatial characteristics and the temporal characteristics.

In some examples, each sub-network or newly added sub-network includes 9 adaptive map convolution modules, and the number of output channels of each adaptive map convolution module is 64,128, 256 and 256; the BN layer of data is added at the beginning to normalize the input data, a global average pooling layer is performed to pool feature maps of different samples to the same size, and the final output is sent to the SoftMax classifier to obtain the prediction. The self-adaptive graph convolution module comprises a space graph convolution layer convs, a time graph convolution layer convt, an additional random discarding treatment Dropout and a residual connection which are sequentially connected; wherein, Dropout rate is set to 0.5; after the space map convolution layer convs and the time map convolution layer convt, a batch normalization layer and an activation function layer are respectively connected.

Through the action recognition device of the second embodiment of the invention, the transmission object is determined by establishing the data containing relation of the whole application, so that the aim of recognizing the human action in the video stream is achieved. The action recognition device provided by the embodiment of the invention can execute the action recognition method based on the self-adaptive graph convolution neural network provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE III

The embodiment of the application provides an electronic device, which comprises a processor, a memory, an input device and an output device; in the electronic device, the number of the processors can be one or more; the processor, memory, input devices, and output devices in the electronic device may be connected by a bus or other means.

The memory, which is a computer-readable storage medium, may be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the detection method in the embodiments of the present invention. The processor executes various functional applications and data processing of the electronic device by running the software programs, instructions and modules stored in the memory, that is, the method for recognizing the action based on the adaptive graph convolution neural network provided by the embodiment of the invention is realized.

The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory may further include memory located remotely from the processor, and these remote memories may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, and may include a keyboard, a mouse, and the like. The output device may include a display device such as a display screen.

Example four

The present application provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method for motion recognition based on an adaptive graph convolution neural network is implemented as described above.

Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the unified processing method based on the context consistency of the environment provided by any embodiments of the present invention.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A motion recognition method based on an adaptive graph convolution neural network is characterized by comprising the following steps:

2. The method for motion recognition based on adaptive graph convolution neural network of claim 1, wherein in step S2, the process of calculating the change of angular momentum when the bone rotates around the key node during the human motion process, and using the variable of the angle between the adjacent bone edges as the deep spatial feature includes the following steps:

wherein m is the total number of angles,

in the formula (I), the compound is shown in the specification,

is the value of the mth angle in the (n-1) th key frame.

3. The method for motion recognition based on adaptive graph convolution neural network of claim 2, wherein in step S3, the step of extracting energy information within the motion duration of the human body, accumulating the angle differences generated by the rotation of the bone around the key nodes to obtain a total of angle changes within the motion duration, dividing the accumulated total of angle differences corresponding to each key node by the number of key frames of the current motion to calculate an average energy change value of each key node, and using the average energy change value as the deep temporal feature comprises the following steps:

wherein the subscript "1 to m-1" represents the number of the key node,

4. The method for motion recognition based on adaptive graph convolution neural network of claim 1, wherein in step S4, the process of constructing the biflow graph convolution neural network includes the following steps:

step 4.1: building a self-adaptive graph volume layer; the adaptive graph convolution layer is used for optimizing the topology of the network together with other parameters of the network in an end-to-end learning mode, the skeleton graph is unique to different layers and samples, and the topology of the graph is formed by an adjacency matrix A_kSum mask M_kDetermination of A_kDetermining whether a connection exists between two vertices, M_kThe strength of the linkage was determined, giving the following expression:

given a feature matrix input, two embedding functions θ (-) and

dimension of input from C_inX T X N to C_exTxN, the two embedded feature matrices being rearranged and reshaped to an NxC_eMatrix of T and one C_eMultiplying T × N matrix to obtain a similar matrix, and calculating C by using the following formula_k：

In the formula, W_θAnd

is an embedding function theta (-) and

the parameters of (1);

step 4.4: building a dual-flow graph convolutional neural network;

5. The method for motion recognition based on adaptive graph-convolution neural network of claim 4, wherein in step S5, the step of calculating the corresponding motion type includes:

S＝S₁W₁+S₂W₂+S₃W₃+S₄W₄

6. An action recognition device based on an adaptive graph convolution neural network, characterized in that the action recognition device comprises:

7. The adaptive graph convolution neural network-based action recognition system of claim 6, wherein the dual-flow graph convolution neural network includes 2 sub-networks; the joint data and the bone data are respectively used as input data of 2 sub-networks, and corresponding prediction scores are obtained after sub-network processing.

8. The adaptive graph convolution neural network-based motion recognition system of claim 7, wherein each of the sub-networks or newly added sub-networks comprises 9 adaptive graph convolution modules, and the number of output channels of each adaptive graph convolution module is 64,128, 256 and 256; adding a BN layer of data to normalize the input data at the beginning, performing a global average pooling layer to pool feature maps of different samples to the same size, the final output being sent to a SoftMax classifier to obtain a prediction;

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of adaptive graph convolution neural network based action recognition according to any one of claims 1-5.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the adaptive graph convolution neural network-based action recognition method according to any one of claims 1 to 5.