CN111353447B

CN111353447B - Human skeleton behavior recognition method based on graph convolution network

Info

Publication number: CN111353447B
Application number: CN202010146319.0A
Authority: CN
Inventors: 曹江涛; 赵挺; 洪恺临
Original assignee: Liaoning Shihua University
Current assignee: Liaoning Shihua University
Priority date: 2020-03-05
Filing date: 2020-03-05
Publication date: 2023-07-04
Anticipated expiration: 2040-03-05
Also published as: CN111353447A

Abstract

A human skeleton behavior recognition method based on graph convolution network belongs to the field of computer vision and deep learning, and comprises the steps of obtaining a human skeleton video frame and carrying out normalization processing; constructing an intrinsic-dependent connection diagram of a human joint corresponding to each frame diagram, and constructing an extrinsic-dependent connection diagram and an interactive-dependent connection diagram of an individual; obtaining all joint connection diagrams of the interaction whole; weight values are distributed to each edge of each connection graph of the human joint; performing graph convolution operation to obtain the spatial characteristics of the skeleton sequence; and (5) performing time sequence modeling by using a long-short period memory network to obtain the corresponding category of the interaction behavior. According to the invention, the internal dependence connection side can learn basic human behavior characteristics, the external dependence connection side can learn additional behavior characteristics, the interactive dependence connection side can better learn the interaction relationship of two persons, and the motion relationship of the interaction behavior of two persons can be better represented, so that the recognition performance is improved.

Description

Human skeleton behavior recognition method based on graph convolution network

Technical Field

The invention belongs to the technical field of computer vision and deep learning, and particularly relates to a human skeleton behavior recognition method based on a graph rolling network.

Background

The human behavior recognition and understanding based on the video is a leading direction which is focused in the fields of image processing and computer vision, and the behavior recognition is widely applied to the fields of video analysis, intelligent monitoring, human-computer interaction, augmented reality, video retrieval and the like along with the technology fusion and development of deep learning and computer vision. Double interaction behavior is more common and difficult in daily life than single action. The double interaction behavior is mainly divided into research based on RGB and skeleton node data. The traditional RGB video has poor robustness due to factors such as illumination change, shielding, complex background and the like. The skeleton-based joint point data contains compact three-dimensional positions of the main body joints, and is robust to changes in viewpoint, body dimensions and movement speed. Therefore, behavior recognition based on skeletal node data has received increasing attention in recent years.

The double interaction behavior recognition method based on the skeleton joint point mainly comprises two main categories, namely a manual feature-based method and a deep learning-based method. For the first class, vemulapalli ^[1] The human skeleton is represented by the et al as a point in the Lie group, and time modeling and classification are implemented in Lie algebra. Weng ^[2] The Naive Bayesian Nearest Neighbor (NBNN) method is extended to space-time and uses stage-to-class distances to classify behavior. The characteristic design price of the method is complex, and the identification accuracy is difficult to further improve. The deep learning feature-based method can be further divided into a CNN-based model and an RNN-based model. For the CNN-based method, the joint point data are converted into pictures and then sent into a network for learning classification. Such methods ignore timing information in the video. For RNN-based methods, time-series information can be effectively modeled, but dependencies between joints and interaction relationships of two persons are ignored. (see [1 ] for details]Raviteja Vemulapalli,Felipe Arrate,and Rama Chellappa.Human action recognition by representing 3d skeletons as points in a lie group.In CVPR,pages 588–595,2014.[2]Junwu Weng,Chaoqun Weng,and Junsong Yuan.Spatiotemporal naive-bayes nearest-neighbor for skeleton-based action recognition.In CVPR,pages 4171–4180,2017.)。

Recently, with the popular application of graph roll-up networks (GCN, graph Convolutional Network), many researchers have also used the GCN method to conduct experiments in the field of behavior recognition. However, the current research is mainly aimed at single person behavior and mostly adopts human body natural connection diagrams, and ignores the dependency relationship among human body non-natural connection joints. In the existing application of double interaction, two persons are divided into two individuals to be respectively modeled, and interaction dependency relationship between the two persons is ignored.

Disclosure of Invention

Aiming at the problems and the shortcomings of the prior art, the invention provides a double interaction behavior identification method based on a graph rolling network, which comprises the steps of obtaining a double interaction skeleton video; normalizing the coordinates of the joint points of the acquired video; constructing an intra-human joint dependency graph, an individual external dependency graph and an inter-dependency graph; different weights are distributed to the connecting edges of the three joint connection diagrams; sending the spatial features into a graph convolution network for learning and extracting the spatial features; based on the spatial characteristics obtained by each frame, sending the spatial characteristics into a long-short-period memory network for time sequence modeling; and obtaining the recognition result of the interactive behavior category.

The method specifically comprises the following steps:

step S10, shooting video: starting a camera, recording double interactive videos, collecting skeleton videos of various interactive actions of different action executives as training videos of the interactive actions, marking the interactive action meanings of the various training videos, and establishing a video training set.

Step S20, carrying out normalization processing on a preset video frame in the acquired skeleton video to serve as a skeleton sequence to be identified.

Step S30, for each frame of image in the skeleton sequence to be identified, constructing a corresponding human joint internal dependent connection image according to the joint point coordinates, wherein the joint points are nodes of the image, and natural connection between the joint points is an internal dependent connection edge of the image; constructing external dependency connection edges of single persons and interactive dependency connection edges of double persons, and forming a human body joint connection diagram of each frame of the skeleton sequence to be identified;

step S40, respectively distributing weights to edges of three joint connection graphs corresponding to each frame graph of the skeleton sequence to be identified, and obtaining corresponding human joint connection graphs with different weight values;

step S50, performing graph convolution operation on the human body joint connection graphs with different weight values corresponding to each frame of the skeleton sequence to be identified, and obtaining the spatial characteristics of the skeleton sequence to be identified;

and step S60, performing time sequence modeling on the time dimension based on the spatial characteristics of the skeleton sequence to be identified, and obtaining the behavior category of the skeleton sequence to be identified.

Further, "a frame of a preset video in the acquired skeleton video is normalized and then used as a skeleton sequence to be identified", the method is as follows:

step S11, performing preset equidistant sampling on the obtained original skeleton video to serve as a training and recognition skeleton sequence;

step S12, carrying out rotation, translation and scale normalization processing on the joint point coordinates of each frame in the obtained skeleton sequence to obtain the skeleton sequence to be identified, wherein the specific method comprises the following steps:

wherein the method comprises the steps of

The ith coordinate value for the original acquired T-th frame, J and T represent the set of the node and the acquired frame,

is the processed coordinate value;

rotation matrix R and rotation origin o _R The definition is as follows:

wherein v is ₁ And v ₂ Is the vector perpendicular to the ground and the difference vector between the left and right hip joints of the original skeleton in each sequence,

and v ₁ ×v ₂ Respectively represent v ₁ And v ₂ Vector projection on and the outer product of these two vectors, +.>

And->

The coordinates of the left and right hip joints of the initial skeleton of each sequence are represented.

Further, "for each frame of image in the skeleton sequence to be identified, constructing a corresponding internal dependent connection image of the human joint according to the coordinate of the joint points, wherein the joint points are nodes of the image, and natural connection between the joint points is an internal dependent connection edge of the image; constructing a single external dependency connecting edge and a double interactive dependency connecting edge, which form a human body joint connection diagram of each frame of a skeleton sequence to be identified, wherein the method comprises the following steps:

human body modeling is carried out on each frame by regarding each frame double interaction as a whole structure G (x, W) graph, wherein

Three-dimensional coordinates of 2N joints are included, W is a 2N x 2N weighted adjacency matrix:

(w _1,2 ) _mn =γ, first person node m and second person node n

Wherein alpha, beta, gamma respectively represent weights corresponding to the intrinsic dependency, the extrinsic dependency and the interactive dependency.

Further, "weights are respectively allocated to edges of three joint connection graphs corresponding to each frame graph of the skeleton sequence to be identified, so as to obtain corresponding human joint connection graphs with different weight values", and the method comprises the following steps:

α=3, β=1, γ=5 to emphasize internal connection relationships, and additional external connection relationships, highlighting inter-connection relationships.

Further, "carrying out graph convolution operation on the human body joint connection graph with different weight values corresponding to each frame graph of the skeleton sequence to be identified, and obtaining the spatial characteristics of the skeleton sequence to be identified", wherein the method comprises the following steps:

wherein represents a graph convolution operation; />

Representing the graph convolution kernel. W is a weighted adjacency matrix of the human body joint connection diagram.

The concrete graph convolution kernel is calculated as follows: the graph laplace normalizes over the spectral domain: l=i _n -D ^-1/2 WD ^-1/2 Wherein D is the angular matrix, D _ii ＝∑ _j w _ij Scaling L to

Representation->

Wherein lambda is _max Is the maximum characteristic value of L, T _k Is chebyshev polynomials. The convolution operation can be expressed as:

here eta e eta ₀ ,η ₁ ...,η _K-1 ]Is a training parameter and K is the size of the graph convolution kernel.

Further, "based on the spatial features of the skeleton sequence to be identified, performing convolution operation in the time dimension to obtain the behavior category of the skeleton sequence to be identified", the method is as follows:

and (3) for the spatial characteristic information of each frame obtained by the graph convolution operation, after being unfolded through the full-connection layer, the spatial characteristic information is sent into a long-period memory network for time sequence modeling, and is classified by adopting softmax to obtain a final interactive behavior classification result.

The invention has the advantages and effects that:

according to the double interaction behavior recognition method based on the graph rolling network, a weighted joint connection graph added with double interaction dependency relationship is constructed, the graph rolling network is adopted to obtain the double interaction space characteristics with discriminant, and then the double interaction space characteristics are sent to the long-period memory network to obtain the dynamic time relationship for modeling, so that recognition accuracy is improved.

Drawings

FIG. 1 is a flow chart of a double interaction behavior recognition method based on a graph convolution network;

FIG. 2 is a schematic illustration of an intra-articular, and inter-articular graph constructed in accordance with the present invention;

FIG. 3 is a flowchart of an algorithm of the present invention;

FIG. 4 is a LSTM module cell diagram;

fig. 5 is a confusion matrix of the invention for the NTU rgb+d dataset test results.

Detailed Description

The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

The invention discloses a double interaction behavior identification method based on a graph rolling network, which comprises the following steps:

In order to more clearly describe the method for identifying double interaction behavior based on graph rolling network of the present invention, each step in the method embodiment of the present invention is described in detail below with reference to fig. 1.

With the development of image processing technology, microsoft Kinect cameras can be directly adopted to obtain skeleton videos of two people with interactive behaviors, and corresponding node data are stored.

Due to character change and visual angle change in shooting, normalization processing is carried out on the character change and visual angle change in the data processing stage, and the specific method comprises the following steps:

wherein the method comprises the steps of

is the processed coordinate value;

rotation matrix R and rotation origin o _R The definition is as follows:

And->

Step S30, for each frame of image in the skeleton sequence to be identified, constructing a corresponding human joint internal dependent connection image according to the joint point coordinates, wherein the joint points are nodes of the image, and natural connection between the joint points is an internal dependent connection edge of the image; constructing external dependency connection edges of single person and interactive dependency connection edges of double persons, and forming human body joint connection diagrams of each frame of a skeleton sequence to be identified together by the three parts, wherein the method comprises the following steps:

(w _1,2 ) _mn =γ, first person node m and second person node n

Step S40, respectively distributing weights to edges of three joint connection graphs corresponding to each frame graph of the skeleton sequence to be identified, and obtaining corresponding human joint connection graphs with different weight values:

weight assignment, α=3, β=1, γ=5 to emphasize internal connection relationships, and additional external connection relationships, highlighting inter-connection relationships.

Step S50, performing graph convolution operation on the human body joint connection graph with different weight values corresponding to each frame graph of the skeleton sequence to be identified, and obtaining the spatial characteristics of the skeleton sequence to be identified:

given a T-frame video, a graph G is constructed according to the method of claim 3 ₁ ,G ₂ ,...,G _T ]Graph G constructed for each t-frame _T It is input into the picture scroll layer:

wherein represents a graph convolution operation;

The concrete graph convolution kernel is calculated as follows:

the graph laplace normalizes over the spectral domain: l=i _n -D ^-1/2 WD ^-1/2 Wherein D is the angular matrix, D _ii ＝∑ _j w _ij Scaling L to

Representation->

Step S60, based on the spatial characteristics of the skeleton sequence to be identified, performing convolution operation on the time dimension to obtain the behavior category of the skeleton sequence to be identified:

spatial feature information f for each frame obtained by the graph convolution operation _t And after the full-connection layer is unfolded, the full-connection layer is sent into a long-short period memory network for time sequence modeling, and the full-connection layer is classified by adopting softmax to obtain a final interactive behavior recognition result.

A dataset of the validation algorithm is presented. The NTU rgb+d dataset is the largest current skeleton-based behavior recognition dataset, has more than 56000 sequences and 400 ten thousand frames, and has 60 types of actions in total, and each skeleton has 25 nodes, and relates to single person actions and double person actions. In this embodiment, 11 kinds of double interaction behaviors in NTU rgb+d are adopted as the data set.

There are two types of protocols for the evaluation method of the dataset: cross-subjects (CS) and cross-view (CV). The proposed method is evaluated herein using CV standards.

According to the CV evaluation criteria, camera number 2,3 captured data for training and camera number 1 captured data for testing. The final recognition rate is 88%, and the obvious recognition effect is achieved. The confusion matrix is shown in fig. 4.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will fall within the scope of the present invention.

Claims

1. A human skeleton behavior recognition method based on a graph rolling network is characterized by comprising the following steps of: the identification method comprises the steps of obtaining a double interaction skeleton video; normalizing the coordinates of the joint points of the acquired video; constructing an intra-human joint dependency graph, an individual external dependency graph and an inter-dependency graph; different weights are distributed to the connecting edges of the three joint connection diagrams; sending the spatial features into a graph convolution network for learning and extracting the spatial features; based on the spatial characteristics obtained by each frame, sending the spatial characteristics into a long-short-period memory network for time sequence modeling; obtaining an identification result of the interactive behavior category;

the identification method specifically comprises the following steps:

step S10, shooting video: starting a camera, recording double interactive videos, collecting skeleton videos of various interactive actions of different action executives as interactive action training videos, marking the interactive action meanings of the various training videos, and establishing a video training set;

step S20, carrying out normalization processing on a preset video frame in the acquired skeleton video to serve as a skeleton sequence to be identified;

step S60, performing time sequence modeling on the time dimension based on the spatial characteristics of the skeleton sequence to be identified to obtain the behavior category of the skeleton sequence to be identified;

in the step S30, "for each frame of the frame sequence to be identified, a corresponding intra-articular dependent connection graph of the human is constructed according to the coordinates of the nodes, the nodes are nodes of the graph, and the natural connection between the nodes is an intra-articular dependent connection edge of the graph; constructing a single external dependency connecting edge and a double interactive dependency connecting edge, which form a human body joint connection diagram of each frame of a skeleton sequence to be identified, wherein the method comprises the following steps:

(w _1,2 ) _mn =γ, first person node m and second person node n

Wherein alpha, beta, gamma respectively represent weights corresponding to the internal dependency relationship, the external dependency relationship and the interactive dependency relationship;

in the step S40, "weights are respectively assigned to edges of three kinds of joint connection diagrams corresponding to each frame of the frame sequence to be identified, so as to obtain corresponding human joint connection diagrams with different weight values", the method is as follows:

2. The human skeleton behavior recognition method based on graph rolling network of claim 1, wherein the human skeleton behavior recognition method is characterized by comprising the following steps of: in the step S20, "the frame to be identified is a frame sequence to be identified after normalizing a preset video frame in the acquired frame video", the method is as follows:

wherein the method comprises the steps of

For the ith coordinate value of the original acquired T-th frame, J and T represent the set of the node and the acquired frame,/for the node and the acquired frame>

Is the processed coordinate value;

rotation matrix R and rotation origin o _R The definition is as follows:

wherein v is ₁ And v ₂ Is the vector perpendicular to the ground and the difference vector between the left and right hip joints of the initial skeleton in each sequence, proj _v1 (v ₂ ) And v ₁ ×v ₂ Respectively represent v ₁ And v ₂ The vector projection on and the outer product of these two vectors,

and->

3. The human skeleton behavior recognition method based on graph rolling network according to claim 1, wherein in step S50, "the graph rolling operation is performed on the human joint connection graph with different weight values corresponding to each frame graph of the skeleton sequence to be recognized, so as to obtain the spatial feature of the skeleton sequence to be recognized", the method is as follows:

given a T frame video, construct graph [ G ] ₁ ,G ₂ ,...,G _T ]Graph G constructed for each t-frame _T It is input into the picture scroll layer:

wherein represents a graph convolution operation;

representing a graph convolution kernel, wherein W is a weighted adjacency matrix of the human body joint connection graph;

the concrete graph convolution kernel is calculated as follows:

Representation->

Wherein lambda is _max Is the maximum characteristic value of L, T _k For chebyshev polynomials, the convolution operation can be expressed as:

4. The human skeleton behavior recognition method based on graph convolution network according to claim 1, wherein in step S60, "based on the spatial feature of the skeleton sequence to be recognized, a convolution operation is performed in a time dimension to obtain a behavior class of the skeleton sequence to be recognized", the method is as follows: