CN113158970A

CN113158970A - Action identification method and system based on fast and slow dual-flow graph convolutional neural network

Info

Publication number: CN113158970A
Application number: CN202110510781.9A
Authority: CN
Inventors: 高跃; 陈自强
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-07-23
Anticipated expiration: 2041-05-11
Also published as: CN113158970B

Abstract

The invention provides a method and a system for recognizing actions based on a fast-slow dual-flow graph convolutional neural network, wherein the method comprises the following steps: acquiring human skeleton joint characteristics; regularizing the human body skeleton joint characteristics, and deforming the shapes of the human body skeleton joint characteristics in one batch; copying the processed human body skeleton joint characteristics to generate two identical human body skeleton joint characteristics, and respectively inputting the two identical human body skeleton joint characteristics to a fast branch and a slow branch of a fast-slow double-flow graph convolution network for characteristic learning; and eliminating the dimensionality of the features of each action category through a global pooling layer, mapping the dimensionality-eliminated features to corresponding action categories through a full connection layer, and obtaining the score of each action category through a Softmax function. The method solves the problem that the modeling of the time sequence information is weak in the prior art, and is a method for capturing the time sequence information and the fast and slow motion information better.

Description

Action identification method and system based on fast and slow dual-flow graph convolutional neural network

Technical Field

The invention relates to the technical field of action recognition based on skeleton information, in particular to the technical field of action recognition based on skeleton information.

Background

In the task of motion recognition based on skeletal information, a method based on a graph convolution neural network is the current mainstream method. The graph convolution neural network is designed for feature extraction of a single static graph structure, and is weak for extracting time sequence information. The human skeleton information is a time-series continuous graph structure data, and can also be regarded as a dynamic graph data. For the task of motion recognition, capturing only the spatial structure information (single frame skeleton information) of the static image and ignoring the timing information cannot achieve satisfactory performance. Generally, for actions which only need single frame of static information and can be distinguished, the method based on the graph convolution neural network can obtain better performance; and some actions are similar to other actions due to the static frame, and the actions can be distinguished by adding time sequence action information, so that the model has better modeling capability of the time sequence information.

The design center of gravity of many current methods based on the graph convolution neural network improves the performance of the model by defining adaptive adjacency matrixes, new graph structure modeling methods, new node connection and the like on the aspect of capturing spatial structure information. Compared with the ST-GCN which applies GCN to the task of human skeleton action recognition for the first time, the methods have certain performance improvement. However, in the modeling of the timing information, the methods simply follow the two-dimensional convolution used by the ST-GCN to model the timing information, and are not greatly improved.

In the RGB video-based method, interaction of modeling timing information and modeling spatio-temporal information has been an important topic, and researchers use optical flow modalities to model motion information or use 3D convolutional networks to model both temporal and spatial information. In recent years, a convolutional neural network based method Slowfast has been greatly successful in an RGB video based motion recognition method.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention aims to provide a method for recognizing actions based on a fast and slow dual-flow graph convolutional neural network, which is designed on the basis of a graph convolutional neural network method and is used for capturing time sequence information and fast and slow action information better by using the fast and slow dual-flow graph convolutional neural network so as to improve the accuracy of action recognition.

The second purpose of the invention is to provide a motion recognition system based on the fast and slow dual-flow graph convolutional neural network.

A third object of the invention is to propose a computer device.

A fourth object of the invention is to propose a non-transitory computer-readable storage medium.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides a method for identifying an action based on a fast-slow dual-flow graph convolutional neural network, including the following steps:

step S10, obtaining human skeleton joint characteristics;

step S20, regularizing the human body skeleton joint features, wherein the shapes of a batch of the human body skeleton joint features are deformed, a one-dimensional regularization module is used for regularizing a time sequence dimension, and the shapes of a batch of the human body skeleton joint features are deformed into original shapes again;

step S30, copying the human body skeleton joint features processed in the step S20 to generate two identical human body skeleton joint features, inputting the two identical human body skeleton joint features to a fast branch and a slow branch of a fast-slow double-flow graph convolution network respectively for feature learning, and fusing learning results of the fast branch and the slow branch to obtain features of each action category, wherein the fast branch and the slow branch of the fast-slow double-flow graph convolution network have the same network structure and have different network parameter configurations and input features;

and step S40, performing dimensionality elimination on the features of each action category through the global pooling layer, mapping the features subjected to dimensionality elimination to corresponding action categories through the full connection layer, and obtaining the score of each action category through a Softmax function.

Optionally, in an embodiment of the present application, the step S10 includes the following steps:

human skeleton joint features are obtained from the data set, and the feature shape of each sample is as follows:

(C,T,M,V)

wherein C is the number of characteristic channels, has a value of 3 and represents the three-dimensional coordinates (x, y, z) of the joint points; t represents the number of frames of the action; m represents the number of persons performing the action; v represents the number of human joint points.

Optionally, in an embodiment of the present application, the step S20 includes the following steps:

carrying out regularization processing on data, using batch training in the training process, wherein the characteristic shape of the tensor of one batch is as follows:

(B,C,T,M,V)

firstly, deforming the one-dimensional batch tensor into the following steps:

(B,M*V*C,T)

and then, using one-dimensional batch regularization module to regularize the time sequence T dimension, and re-deform the features into the original shape (B, C, T, M, V).

Optionally, in an embodiment of the present application, the specific steps in step S30 include:

each branch comprises a plurality of continuously superposed graph convolution blocks, and each graph convolution block comprises a space graph convolution layer and a time sequence convolution layer; the time sequence convolution layer is a two-dimensional convolution module, the size of a convolution kernel is (t,1), and t is a time sequence receptive field of the convolution kernel; after the two convolution layers, a batch regularization layer and a ReLU activation function are attached, so that the characteristics of each channel are ensured to keep the same distribution; the calculation of the convolution block is described using the following formula:

B_kand C_kIs an adaptive adjacency matrix proposed in 2s-AGCN, which changes during network training, wherein B_kIs set to A at initialization_kFor learning the potential association of any two nodes; c_kIs a matrix calculated according to the sample characteristics and used for describing the specific node association of the sample.

Optionally, in an embodiment of the present application, the following two formulas respectively describe the feature shapes of the input features of the map convolution block at the same stage:

f_fast _i _n＝(B,βC,αT,V,M)

f_slow _i _n＝(B,C,T,V,M)

the timing dimension of the fast branch is always at₁Alpha is a positive integer representing the ratio of the input frame rate of the fast branch, in which the number of channels beta, to the frame rate of the slow branch in the initial input signature_iC_iIs significantly less than the channel number C of the convolution block of the same-stage slow branch graph_iWhere i is the block number, β_iIs a value less than 1, e.g., 1/3, and V for both branches is consistent, both being the number of graph nodes.

Optionally, in one embodiment of the present application, the information learned by both fast and slow branches is shared using a cross-connect module, fusing from the fast branch to the slow branch, since

And

the feature shapes of (A, B, beta C, alpha T, V, M) and (B, C, T, M, V) are respectively (B, beta C, alpha T, V, M) and (B, C, T, M, V), firstly, a two-dimensional convolution layer is adopted to carry out feature shape conversion, a batch regularization layer and a ReLU function are added after feature shape conversion is carried out, and then the two features are fused in a splicing or adding mode.

Optionally, in an embodiment of the application, in step S40, the final feature obtained in step S30 is eliminated by a global pooling layer, three dimensions of the time sequence T, the graph node V, and the number M of people are mapped to each action category by a full connection layer, and finally, a score of each action category is obtained by a Softmax function.

In order to achieve the above object, a second embodiment of the present application provides an action recognition system based on a fast-slow dual-flow graph convolutional neural network, including the following modules:

the acquisition module is used for acquiring the characteristics of the human skeleton joints;

the processing module is used for carrying out regularization processing on the human body skeleton joint features, wherein the shapes of a batch of the human body skeleton joint features are deformed, a one-dimensional regularization module is used for regularizing a time sequence dimension, and then the shapes of a batch of the human body skeleton joint features are deformed into the original shapes again;

the generating module is used for copying the human body skeleton joint features processed by the processing module, generating two identical human body skeleton joint features, inputting the two identical human body skeleton joint features to a fast branch and a slow branch of a fast-slow double-flow graph convolution network respectively for feature learning, and fusing learning results of the fast branch and the slow branch to obtain the features of each action category, wherein the fast branch and the slow branch of the fast-slow double-flow graph convolution network have the same network structure and have different network parameter configurations and input features;

and the determining module is used for carrying out dimensionality elimination on the features of each action category through the global pooling layer, mapping the features subjected to dimensionality elimination to the corresponding action categories through the full connection layer, and obtaining the score of each action category through a Softmax function.

In order to achieve the above object, a third aspect of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for motion recognition based on inter-joint association modeling according to the first aspect of the present application.

To achieve the above object, a non-transitory computer-readable storage medium is provided in a fourth embodiment of the present application, and a computer program is stored thereon, and when being executed by a processor, the computer program implements a motion recognition method based on inter-joint association modeling as described in the first embodiment of the present application.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of an action identification method based on a fast-slow dual-flow graph convolutional neural network according to an embodiment of the present application.

FIG. 2 is a schematic structural diagram of a fast-slow dual-flow graph convolutional neural network according to an embodiment of the present application;

FIG. 3 is a diagram illustrating the change of the feature shapes of the input features of the fast and slow branches with the increase of the number of the convolution blocks according to the embodiment of the present application;

fig. 4 is a schematic view of a transverse connection module according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of an action recognition system based on a fast-slow dual-flow graph convolutional neural network according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes an action recognition method based on a fast-slow dual-flow graph convolutional neural network according to an embodiment of the present invention with reference to the accompanying drawings.

As shown in fig. 1, to achieve the above object, an embodiment of a first aspect of the present invention provides a method for identifying an action based on a fast-slow dual-flow graph convolutional neural network, including the following steps:

step S10, obtaining human skeleton joint characteristics;

In an embodiment of the present application, the step S10 further includes the following steps:

human skeletal joint features are obtained from public data sets such as NTU RGB + D, and the feature shape of each sample is as follows:

(C,T,M,V)

In an embodiment of the present application, the step S20 further includes the following steps:

(B,C,T,M,V)

firstly, deforming the one-dimensional batch tensor into the following steps:

(B,M*V*C,T)

As shown in fig. 2, our network structure contains two branches, which we call fast and slow branches, respectively.

In an embodiment of the present application, further, the specific steps in step S30 include:

each branch comprises a plurality of continuously superposed graph convolution blocks, and each graph convolution block comprises a space graph convolution layer and a time sequence convolution layer; the time sequence convolution layer is a two-dimensional convolution module, the size of a convolution kernel is (t,1), and t is the time sequence feeling of the convolution kernel; after the two convolution layers, a batch regularization layer and a ReLU activation function are attached, so that the characteristics of each channel are ensured to keep the same distribution; the calculation of the convolution block is described using the following formula:

f_fast _i _n＝(B,βC,αT,V,M)

f_slow _i _n＝(B,C,T,V,M)

the timing dimension of the fast branch is always at₁α is a positive integer representing the input frame rate of the fast branch and the frame rate of the slow branch in the initial input signatureRatio, in fast branch, number of channels beta_iC_iIs significantly less than the channel number C of the convolution block of the same-stage slow branch graph_iWhere i is the block number, β_iIs a value less than 1, e.g., 1/3, and V for both branches is consistent, both being the number of graph nodes.

In one embodiment of the present application, further assuming that there are N convolutional blocks in the network structure, in the slow branch, in the time-series convolutional layer in the convolutional block of the graph, we will reduce the frame rate by the step size of the time-series convolutional layer, thus there is T₁≥T₂≥…≥T_N(ii) a On the other hand, in each tile, the number of output channels will gradually increase with the increase of the tile to improve the capture capability of the slow branch for the graph space structure information, so there is C₁≤C₂≤…≤C_N. In the fast branch, in the time sequence convolution layers of all the graph convolution blocks, the step length of the convolution kernel is set to 1 to ensure that the frame rate is not reduced, therefore, the time sequence dimension of the fast branch is always aT₁And alpha is a positive integer and represents the ratio of the input frame rate of the fast branch to the frame rate of the slow branch in the initial input features. In the fast branch, the number of channels beta_iC_iIs significantly less than the channel number C of the convolution block of the same-stage slow branch graph_iWhere i is the block number, β_iIs a value less than 1, such as 1/3. The V of the two branches is identical, both being the number of graph nodes.

In one embodiment of the present application, further, as shown in FIG. 4, the information learned by both fast and slow branches is shared using a cross-connect module, merging from the fast branch to the slow branch, since

And

the feature shapes of (A) and (B) are (B, beta C, alpha T, V, M) and (B, C, T, M, V), firstly, a two-dimensional convolution layer is adopted for feature shape conversion, and a batch regularization layer are added after the feature shape conversion is carried outReLU function, then the two features are fused in a splicing or adding manner.

We first use a two-dimensional convolutional layer for feature shape transformation, and then add batch regularization layer and ReLU function, and then fuse the two features by means of splicing or addition. The above process can be described by the following formula.

Wherein Conv2D is a two-dimensional convolutional layer, BN is a batch regularization layer, ReLU is an activation function, Fuse is a fusion function, and the fusion mode can adopt modes such as summation (Sum) and splicing (coordination), and the two modes have close performance.

Further, the present embodiment employs a cross connection module inserted between two branches to share information between two modules. In the experiment of this embodiment, you have used 10 convolution blocks, where the number of input channels of the fast branch and the slow branch in each convolution block is 3, 128, 256, 512 and 3, 32, 64, 128, respectively.

In an embodiment of the application, in step S40, the final feature obtained in step S30 is eliminated by a global pooling layer through three dimensions of time sequence T, graph node V, and number M of people, the feature is mapped to each action category through a full connection layer, and finally, a score of each action category is obtained through a Softmax function.

To achieve the above object, as shown in fig. 5, a second aspect of the present application provides a fast-slow dual-flow graph convolutional neural network-based action recognition system according to the present invention, which includes the following modules:

In order to implement the foregoing embodiments, the present invention further provides a computer device, where the computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method implements the method for identifying an action based on a fast-slow dual-flow graph convolutional neural network according to the embodiments of the present application.

In order to implement the foregoing embodiments, the present invention further provides a non-transitory computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for identifying an action based on a fast-slow dual-flow graph convolutional neural network according to an embodiment of the present application is implemented.

Although the present application has been disclosed in detail with reference to the accompanying drawings, it is to be understood that such description is merely illustrative and not restrictive of the application of the present application. The scope of the present application is defined by the appended claims and may include various modifications, adaptations, and equivalents of the invention without departing from the scope and spirit of the application.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A motion recognition method based on a fast and slow dual-flow graph convolutional neural network is characterized by comprising the following steps:

step S10, obtaining human skeleton joint characteristics;

2. The method of claim 1, wherein the step S10 includes the steps of:

(C,T,M,V)

3. The method of claim 1, wherein the step S20 includes the steps of:

(B,C,T,M,V)

firstly, deforming the one-dimensional batch tensor into the following steps:

(B,M*V*C,T)

4. The method as claimed in claim 1, wherein the step S30 includes the following steps:

5. The method of claim 4, wherein the following two formulas describe the feature shapes of the input features of the map convolution block at the same stage respectively:

f_fast _i _n＝(B,βC,αT,V,M)

f_slow _i _n＝(B,C,T,V,M)

6. The method of claim 4, wherein the share cache is shared using a cross connect moduleThe information learned by the two slow branches is merged from the fast branch to the slow branch because

And

7. The method as claimed in claim 1, wherein in step S40, the final feature obtained in step S30 is passed through a global pooling layer to eliminate three dimensions of time sequence T, graph nodes V and number M of people, and the feature is mapped to each action category through a full connection layer, and finally, the score of each action category is obtained through a Softmax function.

8. A motion recognition system based on a fast and slow biflow graph convolutional neural network is characterized by comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1-7 when executing the computer program.

10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of any one of claims 1-7.