CN114613011A

CN114613011A - Human body 3D (three-dimensional) bone behavior identification method based on graph attention convolutional neural network

Info

Publication number: CN114613011A
Application number: CN202210265277.1A
Authority: CN
Inventors: 周树波; 陈冉冉; 蒋学芹; 潘峰; 杨义
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-06-10

Abstract

The human body 3D bone behavior identification method based on the graph attention convolutional neural network comprises the steps of preprocessing 3D bone data by using a batch processing layer to obtain normalized data, using the normalized data as the input of a multi-head graph attention module, and obtaining a graph attention matrix through the graph attention module; taking the graph attention matrix as a class adjacency matrix to participate in the process of calculating the convolution of the space domain graph, extracting high-order characteristics of the space domain, and then performing time domain graph convolution operation on output characteristics of the space domain graph attention convolution to obtain an output result; and the output result is sent into a softmax classifier for classification after passing through a full connection layer and global average pooling, and a final prediction result is obtained. According to the method, an image attention mechanism is introduced to obtain an image attention matrix when spatial domain features are captured, and a unique image topological structure under different actions can be learned completely depending on 3D skeletal data without being limited to a priori human topological structure. The network identification performance is improved, and the flexibility and generalization capability of the network are increased.

Description

Human body 3D (three-dimensional) bone behavior identification method based on graph attention convolutional neural network

Technical Field

The invention relates to a human body 3D bone behavior identification method based on a graph attention convolution network, in particular to a human body 3D bone behavior identification deep learning method based on the human body, and belongs to the field of computer vision.

Background

Human behavior recognition technology is a hot issue widely studied in the field of computer vision. The human body action analysis and understanding method has wide application value in a plurality of real fields, such as intelligent monitoring systems, human-computer interaction, virtual reality and the like.

With the explosive development of computer information and the continuous advancement of video capture technology, many methods begin to focus on behavior recognition of 3D bones rather than research based on videos captured by traditional 2D cameras. Currently, in the field of 3D skeletal behavior recognition, the dominant research approach is a deep learning based approach. The deep learning-based method mainly has three genres, namely convolutional neural network-based, cyclic neural network-based and atlas neural network-based, wherein the direction based on the atlas network is a hot spot of the current research.

The main idea of the 3D bone behavior identification method based on graph convolution is to extract and fuse spatio-temporal information, only consider information propagation of nodes in a first-order neighborhood, and ignore the influence of global nodes. In this case, behaviors with complex spatiotemporal relationships cannot be efficiently expressed. Meanwhile, the skeletal key points involved in different actions are different, and the degree of importance of the key points determined by the connection between the skeletal key points cannot well describe the topology of the action graph. In summary, it is necessary to further improve the human 3D skeletal behavior recognition method.

Disclosure of Invention

The invention aims to solve the problems that the prior graph convolution identification method depends on the prior knowledge of a human body structure and cannot effectively describe the topological structure of a human body 3D skeleton diagram, and provides a human body 3D skeleton behavior identification method based on a graph attention convolution neural network. The scheme uses the idea of graph convolution, introduces a graph attention matrix, realizes feature enhancement, can complete the identification of 3D bones, and improves the accuracy of behavior identification.

The technical scheme of the invention is a human body 3D bone behavior identification method based on a graph attention convolution neural network, which comprises the following steps:

step 1: inputting the 3D bone joint point data into a batch normalization layer, normalizing the data, and outputting normalized bone data f_in；

Step 2: normalizing the processed data f_inInputting an attention module of the graph to obtain an attention matrix T, wherein elements in the matrix reflect the key degrees of different joint points in different actions; the attention matrix is calculated in the following way:

wherein T is_ijThe representation is the value of the element in the ith row and the jth column of the matrix T, a represents the attention mapping function and is a row vector, a^TA transpose matrix representing a, W is a weight matrix, [ | | - ]]Indicating that the features of two nodes are connected, v_tiIs characteristic of the ith node at the tth time frame, v_tjIs a characteristic of the jth node at the tth time frame and is a neighbor node of the first node, e_tiIs the ith node at the t time frame, D (e)_ti) Is node e_tiP-th node represents neighborhood D (e)_ti) Is denoted as e at t time frame_tp(ii) a Expanding the graph attention module to a multi-head mechanism, namely calculating the mean value of a plurality of graph attention matrixes by using a plurality of attention modules to obtain the final graph attention matrix participating in operation;

and step 3: taking the result of adding the attention matrix T obtained by calculation under the multi-head attention mechanism and the adjacent matrix A and the data driving matrix B under the neighborhood divided based on the space configuration as a feature extraction matrix, and extracting high-level features in the space by utilizing a space diagram convolution formula; after the spatial map is convolved, a batch normalization layer and a ReLU nonlinear activation layer are added;

and 4, step 4: the output result of the convolution of the space domain diagram is used as the input of the convolution of the time domain diagram, the feature extraction and the fusion are carried out on the time domain, and the convolution operation in the convolution of the time domain diagram is two-dimensional convolution; similarly, after the time domain graph convolution, there is also a batch normalization layer and a ReLU nonlinear activation layer;

and 5: and finally, after passing through a full-connection layer and a global average pooling layer, performing classification prediction on the output characteristics by using a softmax classifier to obtain a prediction result of the bone joint point data.

The invention uses an image attention mechanism to generate an image attention matrix, different weight coefficients can be learned according to the importance degrees of different joints to different actions, and the coefficient is a constituent element of the attention matrix and is called a normalized attention coefficient. By adding a new attention matrix into the traditional graph convolution network, the network can better learn the relation between the skeletal joint points, the effect of characteristic enhancement is achieved, and the improvement of the identification accuracy is also realized.

Drawings

Fig. 1 is a structural diagram of a human body 3D bone behavior recognition method based on a graph attention convolutional neural network according to the present invention.

FIG. 2 is a block diagram of a multiple graph attention convolution module generating a graph attention matrix.

FIG. 3 is a block diagram illustrating the spatial feature extraction fusion of the attention convolution network.

Detailed description of the preferred embodiments

The following detailed description of embodiments of the invention refers to the accompanying drawings.

Step 1: the raw 3D bone joint point data f is input into a Batch Normalization (BN) layer. The BN layer obtains the three-dimensional characteristics with the mean value of 0 and the variance of 1 by normalizing the characteristics of the 3D bone joint point data in three dimensions

Then through twoData of pairs of parameters gamma and beta

By making a transformation reconstruction

Then outputting normalized bone data f_inWherein the parameters gamma and beta are optimized during the training process by using a gradient descent method.

And 2, step: firstly, normalizing the processed data f_inThe input is multiplied by a weight matrix W to carry out linear change, then the data after the linear change is mapped into a real number by using a graph attention mapping function a, the real number is called an attention coefficient, and then the attention coefficient is normalized by a softmax function to obtain the normalized attention coefficient. Assuming that the skeleton map has N nodes, a normalized attention coefficient reflecting the degree of association between two points can be calculated between any two nodes, and the normalized attention coefficients are combined into an N × N matrix, which is called a map attention matrix T. In order to improve the fitting capability of the model, a multi-head attention mechanism is used, namely a plurality of attention mapping functions are used for generating a plurality of image attention matrixes, and the image attention matrixes are averaged to obtain a final image attention matrix.

And step 3: adding a graph attention force matrix T and a graph adjacency matrix A and a data driving type matrix B under the multi-head attention mechanism in a neighborhood based on space configuration division as a characteristic extraction matrix, wherein the graph adjacency matrix A belongs to R^N×NThe element value is composed of 0 and 1 (the element value is 1, which indicates that two nodes at the row and column are connected), and the data driving type matrix B belongs to R^N×NThe method is initialized randomly before training and optimized in the training process. Dividing the first-order neighborhood of the node into three parts by using a neighborhood division strategy based on space configuration, and then utilizing a space map convolution formula

Performing spatial feature extraction, wherein f_soutIs the output of the convolution of the spatial domain map, W_kAs a weight matrix, inOptimization during training, K_vThe number of neighborhoods divided according to the spatial configuration for the joint point is generally 3, A_k、B_kAnd C_kA graph adjacency matrix, a data driving matrix and a graph attention matrix under three neighborhood representations are respectively represented.

And 4, step 4: and taking the output result of the spatial domain graph convolution as the input of the time domain graph convolution, and extracting and fusing the characteristics in the time domain. The time domain graph convolution is mainly completed by one 2D convolution, and the convolution kernel size is K_tX 1, feature extraction is performed only in the time dimension, and the result is output.

And 5: and (3) unifying the feature dimensions to the category number of classification actions through a full connection layer, performing minimum fitting on the reduced model parameter number through a global average pooling layer, and then performing normalization processing on the output result of the global average pooling layer by using a softmax function to obtain the prediction score of each category action, wherein the action category corresponding to the highest prediction score is used as the final prediction result.

The embodiment of the invention designs a method for a graph attention convolution neural network, and the specific implementation scheme is as follows:

human 3D bone data is a time series of joint points of a human skeleton, essentially a space-time diagram, defined as a graph structure, and can be represented as G ═ V, E, where V denotes a set of nodes, and V ═ E_ti|t＝1，…，T，i＝1，…，N}， v_tiRepresenting a feature vector of an ith joint point on a time frame T in the space-time diagram, wherein T represents the number of time frames, and N represents the number of single-frame bone joint points; the feature vector of the joint point is a three-dimensional coordinate v in space_ti＝(x_ti，y_ti，z_ti) (ii) a E denotes a set of edges connecting nodes, which may be divided into two subsets. A subset E_S＝{υ_tiυ_tjL (i, j) belongs to H, the node j is a neighbor node of the node i, and v_tiAnd H represents a naturally connected human joint set. Another subset is combined as E_F＝{υ_tiυ_(t+1)iDenotes aConnections on successive frames of the same node. Generally, the input dimension of the bone data is (C, T, N), where C represents the characteristic dimension, T represents the number of time frames in a bone sequence, and N represents the number of human bone joints. Firstly, the skeleton data f is normalized under the characteristic dimension through a BN layer to obtain normalized data

Mean 0 and variance 1, and then data are scaled by two parameters γ, β

By making a transformation reconstruction

Then outputting normalized bone data f_inWherein the parameters gamma and beta are optimized during the training process by using a gradient descent method. The main role of this step is to unify the joint coordinates at different time frames.

After obtaining the normalized joint coordinate data, the data is sent to a plurality of map attention modules shown in fig. 2 to calculate an attention matrix under a multi-head attention mechanism. The specific operation of computing the attention matrix can be divided into 3 steps. First step, according to formula e_ij＝Leaky ReLU(a^T[Wu_ti||Wv_tj]) Calculating attention coefficient e_ijIn the formula, a is the attention mapping function, W is the weight matrix, [ | | · h]The characteristics of the two nodes are connected. Secondly, normalizing the attention coefficient by using a softmax function to obtain

As an element of the attention matrix T, α_ijRepresenting the elements of the ith row and j column of the attention matrix. And thirdly, generating a plurality of graph attention matrixes by using a multi-head attention mechanism, wherein the process is completed by a plurality of graph attention networks. Using the normalized input data as input to a plurality of map attention modules, each of which calculates a map attention matrix,the multiple attention moment matrixes are averaged to obtain a final attention matrix used for graph convolution operation.

After the attention force matrix is obtained, a formula is utilized according to a neighborhood partition strategy based on the space configuration

Performing a spatial domain graph convolution operation, wherein f_soutIs the output of the spatial map convolution, W_kFor the weight matrix, optimized during the training process, K_vThe number of neighborhoods divided according to the spatial configuration for the joint point is generally 3, A_k、B_kAnd C_kA graph adjacency matrix, a data driving matrix and a graph attention matrix under three neighborhood representations are respectively represented, and the specific operation of the spatial graph convolution is shown in fig. 3. The spatial domain map convolution module includes a spatial domain map convolution operation, a BN layer, and a ReLU for non-linear activation. And taking the result of the spatial domain graph convolution module as the input of the time domain graph volume module to extract the time domain characteristics. The time domain graph convolution module is mainly a two-dimensional convolution operation, and a BN layer and a ReLU layer are also arranged after the same time domain convolution operation.

The spatial domain graph convolution module and the time domain graph convolution module form a time-space domain graph attention convolution module. The entire graph attention convolution network consists of ten such modules as shown in fig. 1. The number of output channels of the first three modules is 64, the number of output channels of the middle four modules is 128, and the number of output channels of the last four modules is 256. After ten graph attention convolution modules, a full connection layer, a global average pooling layer and a softmax function are used as classifiers for behavior prediction. Performing the above operation results in a predicted result of the joint as input data.

Through the above specific embodiments, the implementation scheme of the graph attention convolution neural network is explained in detail. Compared with the traditional time-space domain graph convolution neural network, the method uses the same time domain convolution module to capture the time domain characteristics. However, when spatial domain features are captured, an image attention mechanism is introduced to obtain an image attention matrix, and a unique image topological structure under different actions can be learned completely depending on 3D skeletal data without being limited to a priori human topological structure. The method not only improves the identification performance of the network, but also increases the flexibility and generalization capability of the network to a certain extent.

The present invention is not limited to the above examples, and any modification or variation made within the scope of the claims is within the scope of the present invention.

Claims

1. A human body 3D bone behavior identification method based on a graph attention convolutional neural network comprises the following steps:

Step 2: normalizing the processed data f_inInputting an attention module of the graph to obtain an attention matrix T, wherein elements in the matrix can reflect the key degrees of different joint points in different actions; the attention matrix is calculated in the following way:

wherein T is_ijThe representation is the value of the element in the ith row and the jth column of the matrix T, a represents the attention mapping function and is a row vector, a^TA transpose matrix representing a, W is a weight matrix, [ | | - ]]Indicating the connection of two node features, v_tiIs characteristic of the ith node at the tth time frame, v_tjIs a characteristic of the jth node at the tth time frame and the jth node is a neighbor node of the ith node, e_tiIs the ith node at the t time frame, D (e)_ti) Is node e_tiP-th node represents neighborhood D (e)_ti) The characteristic of the node in t time frame is represented as v_tp(ii) a The graph attention module is expanded to a multi-head mechanism, namely a plurality of attention modules are used for calculating the mean value of a plurality of graph attention matrixes to obtain the final graph participating in operationAn attention matrix;

2. The human body 3D skeletal behavior recognition method according to claim 1, wherein the specific process of the step 1 is as follows: inputting original 3D bone joint point data f into a batch normalization layer, wherein the batch normalization layer normalizes the features of the 3D bone joint point data in three dimensions to obtain the three-dimensional features with the mean value of 0 and the variance of 1

Then, the data are processed by two parameters of gamma and beta

By making a transformation reconstruction

3. The human 3D bone row of claim 2The identification method is characterized in that the specific process of the step 2 is as follows: firstly, normalizing the processed data f_inMultiplying the input by a weight matrix W to perform linear change, mapping the data after the linear change into an attention coefficient by using a graph attention mapping function a, and then normalizing the attention coefficient by using a softmax function to obtain a normalized attention coefficient; assuming that a skeleton map has N nodes, calculating a normalized attention coefficient reflecting the degree of association between any two nodes, and forming the normalized attention coefficient into an NxN matrix called a map attention matrix T; in order to improve the fitting capability of the model, a multi-head attention mechanism is used, namely a plurality of attention mapping functions are used for generating a plurality of image attention matrixes, and the image attention matrixes are averaged to obtain a final image attention matrix.

4. The human body 3D bone behavior recognition method according to claim 3, wherein the specific process of the step 3 is as follows: adding a graph attention force matrix T and a graph adjacency matrix A and a data driving type matrix B under the multi-head attention mechanism in a neighborhood based on space configuration division as a characteristic extraction matrix, wherein the graph adjacency matrix A belongs to R^N×NThe element value is 0 and 1, the element value is 1, the connection of two nodes at the row and column is represented, and the data driving type matrix B belongs to R^N×NThe method is initialized randomly before training and optimized in the training process. Dividing the first-order neighborhood of the node into three parts by using a neighborhood division strategy based on space configuration, and then utilizing a space map convolution formula

Performing spatial feature extraction, wherein f_soutIs the output of the spatial map convolution, W_kFor the weight matrix, optimized during the training process, K_vNumber of neighbourhoods, A, divided according to spatial configuration for a joint point_k、B_kAnd C_kA graph adjacency matrix, a data driving matrix and a graph attention matrix under three neighborhood representations are respectively represented.

5. The human body 3D bone behavior recognition method according to claim 4, wherein the specific process of the step 4 is as follows: taking the output result of the convolution of the spatial domain graph as the input of the convolution of the time domain graph, and extracting and fusing the characteristics in the time domain; the time domain graph convolution is completed by one 2D convolution, and the convolution kernel size is K_tX 1, feature extraction is performed only in the time dimension, and the result is output.

6. The human body 3D bone behavior recognition method according to claim 5, wherein the specific process of the step 5 is as follows: and (3) unifying the feature dimensions to the category number of classification actions through a full connection layer, performing minimum fitting on the reduced model parameter number through a global average pooling layer, and then performing normalization processing on the output result of the global average pooling layer by using a softmax function to obtain the prediction score of each category action, wherein the action category corresponding to the highest prediction score is used as the final prediction result.