CN114898467A

CN114898467A - Human motion action recognition method, system and storage medium

Info

Publication number: CN114898467A
Application number: CN202210580454.5A
Authority: CN
Inventors: 黄晋; 邢玉玲; 肖菁; 朱佳
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-08-12

Abstract

The invention discloses a human motion action recognition method, a human motion action recognition system and a storage medium, which are applied to the technical field of human motion action recognition and can effectively improve the recognition performance and accuracy of human motion. The method comprises the following steps: constructing a local and non-local space-time graph convolution unit; the local and non-local space-time convolution units comprise local paths and non-local paths, the input ends of the local paths and the non-local paths are connected with the input ends of the local and non-local space-time graph convolution units, and the output of the local paths and the output of the non-local paths are aggregated to be used as the output of the local and non-local space-time graph convolution units; inputting the skeleton diagram sequence into a local and non-local space-time diagram convolution unit to extract a first space-time characteristic; inputting the first time-space characteristic into a global average pooling layer to obtain a second time-space characteristic; and inputting the second space-time characteristics into the full-link layer and the classifier in sequence to predict to obtain an identification result.

Description

Human motion action recognition method, system and storage medium

Technical Field

The invention relates to the technical field of human motion action recognition, in particular to a human motion action recognition method, a human motion action recognition system and a storage medium.

Background

In recent years, motion recognition based on human skeleton has received more and more attention due to its wide application scenarios, such as human-computer interaction, video monitoring, video retrieval, and the like. The human skeleton structure is in the form of a graph, where each shutdown of the human skeleton is primarily identified by a joint type, a time frame index, and a three-dimensional coordinate position. The traditional skeleton-based motion recognition method treats human joints as a group of independent features, and simulates the spatial and temporal dependence of the joints by designing manual features. However, manual features are generally shallow and strongly dependent on the dataset, resulting in a lack of flexibility and generalization of the model. With the development of artificial intelligence and deep learning technology, a great deal of human body action recognition research based on GCN gradually emerges. However, in the related art, it is difficult to capture richer, deep and effective spatio-temporal features between joints when actually processing and modeling.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present invention provides a method, a system and a storage medium for recognizing human body movement, which can effectively improve the recognition performance and accuracy of human body movement.

On one hand, the embodiment of the invention provides a human motion action recognition method, which comprises the following steps:

constructing a local and non-local space-time graph convolution unit; the local and non-local space-time convolution units comprise local paths and non-local paths, the input ends of the local paths and the non-local paths are connected with the input ends of the local and non-local space-time graph convolution units, and the output of the local paths and the output of the non-local paths are aggregated to be used as the output of the local and non-local space-time graph convolution units;

inputting the skeleton diagram sequence into the local and non-local space-time diagram convolution unit to extract a first space-time characteristic;

inputting the first time-space characteristics into a global average pooling layer to obtain second time-space characteristics;

and inputting the second space-time characteristics into a full-link layer and a classifier in sequence to predict to obtain an identification result.

The human motion action recognition method provided by the embodiment of the invention at least has the following beneficial effects: the embodiment first constructs a local and non-local space-time graph convolution unit, which comprises a local path and a non-local path. And then inputting the skeleton diagram sequence into a basic and non-local space-time diagram convolution unit, extracting local space-time information among joints in the skeleton diagram sequence through a local path, and extracting a non-local space dependency relationship and a long-term space-time dependency relationship in the skeleton diagram sequence through a non-local path. Then, the embodiment aggregates the outputs of the local path and the non-local path to obtain a first time-space characteristic, and predicts to obtain an identification result of the human body action after sequentially passing through the global average pooling layer, the full connection layer and the classifier. In the embodiment, an effective spatiotemporal feature extraction unit, namely a local and non-local spatiotemporal graph convolution unit, is constructed by extracting local spatiotemporal information and non-local spatiotemporal information, namely a non-local spatial dependency relationship and a long-term spatiotemporal dependency relationship of a skeleton diagram sequence, so that accurate identification of human motion actions is realized, and the performance and the accuracy of human motion identification are effectively improved.

According to some embodiments of the invention, the constructing the local and non-local space-time graph convolution unit comprises:

constructing a local time-space domain;

constructing a cross-space-time jump connecting edge according to the local space-time domain;

performing space-time convolution construction on the cross-space-time jump connection edge to obtain a local convolution module;

and constructing the local path according to the local convolution module.

According to some embodiments of the invention, the constructing the local and non-local space-time graph convolution unit further comprises:

constructing a non-local space-time graph convolution module;

constructing a multi-scale time map convolution module;

constructing the non-local path according to the non-local space-time graph convolution module and the multi-scale time graph convolution module; the input end of the non-local space-time graph convolution module is connected with the input end of the local and non-local space-time graph convolution unit, the output end of the non-local space-time graph convolution module is connected with the input end of the multi-scale time graph convolution module, and the output of the multi-scale time graph convolution module is used as the output of the non-local path.

According to some embodiments of the invention, the constructing a non-local space-time graph convolution module comprises:

constructing non-local space map features according to the skeleton map sequence;

constructing a mask matrix;

processing the mask matrix through a first activation function to obtain first mask data;

and constructing and obtaining the non-local space-time graph convolution module according to the first mask data and the non-local space graph characteristics.

According to some embodiments of the present invention, the multi-scale time map convolution module includes an expansion convolution sub-module, a convolution sub-module, and a pooling convolution sub-module, wherein inputs of the expansion convolution sub-module, the convolution sub-module, and the pooling convolution sub-module are connected to an output of the non-local space-time map convolution module, and the expansion convolution sub-module and the convolution sub-module are connected to an output of the pooling convolution sub-module.

According to some embodiments of the present invention, the extended convolution submodule includes a convolution operation layer, a first shift operation layer, a point-by-point convolution layer, and a second shift operation layer, and the convolution operation layer, the first shift operation layer, the point-by-point convolution layer, and the second shift operation layer are connected in sequence; the expansion convolution sub-modules are 4 in number and comprise a first expansion convolution sub-module, a second expansion convolution sub-module, a third expansion convolution sub-module and a fourth expansion convolution sub-module, the dot-by-dot convolution layer expansion rate of the first expansion convolution sub-module is 1, the dot-by-dot convolution layer expansion rate of the second expansion convolution sub-module is 2, the dot-by-dot convolution layer expansion rate of the third expansion convolution sub-module is 3, and the dot-by-dot convolution layer expansion rate of the fourth expansion convolution sub-module is 4.

According to some embodiments of the present invention, the local and non-local space-time graph convolution unit includes a first local and non-local space-time graph convolution unit, a second local and non-local space-time graph convolution unit, and a third local and non-local space-time graph convolution unit, the first local and non-local space-time graph convolution unit has a characteristic channel of 96, the first local and non-local space-time graph convolution unit has a time convolution and space-time window step size of 1, the first local and non-local space-time convolution unit has an output end connected to a first processing normalization module and a second activation function in sequence, the second activation function has an output end connected to an input end of the second local and non-local space-time graph convolution unit, the second local and non-local space-time graph convolution unit has a characteristic channel of 192, the second local and non-local space-time graph convolution unit has a time convolution and space-time window step size of 2, the output end of the second local and non-local space-time convolution unit is sequentially connected with a second batch processing normalization module and a third activation function, the output end of the third activation function is connected with the input end of the third local and non-local space-time graph convolution unit, the characteristic channel of the third local and non-local space-time graph convolution unit is 384, the time convolution and space-time window step length of the third local and non-local space-time graph convolution unit is 2, and the output end of the third local and non-local space-time graph convolution unit is connected with the input end of the global average pooling layer.

On the other hand, the embodiment of the invention also provides a human motion action recognition system, which comprises:

the space-time graph convolution unit construction module is used for constructing a local and non-local space-time graph convolution unit; the local and non-local space-time convolution units comprise local paths and non-local paths, the input ends of the local paths and the non-local paths are connected with the input ends of the local and non-local space-time graph convolution units, and the output of the local paths and the output of the non-local paths are aggregated to be used as the output of the local and non-local space-time graph convolution units;

the characteristic extraction module is used for inputting the skeleton diagram sequence into the local and non-local space-time diagram convolution unit to extract a first space-time characteristic;

the pooling module is used for inputting the first time-space characteristic into a global average pooling layer to obtain a second time-space characteristic;

and the prediction output module is used for sequentially inputting the second space-time characteristics into the full-link layer and the classifier to predict to obtain an identification result.

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor is enabled to implement the human motion action recognition method according to the above embodiment.

In another aspect, an embodiment of the present invention further provides a computer storage medium, in which a program executable by a processor is stored, and when the program executable by the processor is executed by the processor, the program is used to implement the human motion action recognition method according to the above embodiment.

Drawings

Fig. 1 is a flowchart of a human motion action recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic block diagram of a human motion recognition system according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram of a local and non-local space-time graph convolution unit according to an embodiment of the present invention;

FIG. 4 is a functional block diagram of a multi-scale time map volume module provided by an embodiment of the present invention;

FIG. 5 is a schematic block diagram of a human motion recognition model according to an embodiment of the present invention;

fig. 6 is a schematic diagram of shift convolution according to an embodiment of the present invention.

Detailed Description

The embodiments described in the embodiments of the present application should not be construed as limiting the present application, and all other embodiments that can be obtained by a person skilled in the art without making any inventive step shall fall within the scope of protection of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

In recent years, motion recognition based on human skeleton has received more and more attention due to its wide application scenarios, such as human-computer interaction, video surveillance, and video retrieval. Wherein the human skeleton structure is in the form of a graph, and each joint of the human skeleton is mainly identified by a joint type, a time frame index and a three-dimensional coordinate position. In the traditional motion recognition method based on the skeleton, human joints are regarded as a group of independent features, and the spatial and temporal dependence of the joints is simulated mainly by designing manual features. Manual features are generally shallow and strongly dependent on the dataset, resulting in a lack of flexibility and generalization of the model. In the related technology, with the rapid development of a deep learning technology, a human body action recognition task based on a skeleton gradually turns to a deep learning method from a traditional method, and dynamic space-time characteristics of a skeleton sequence are constructed in a data driving mode. A great deal of human body action recognition research based on a Graph Convolution Network (GCN) is also presented in the related art. Although the method for coding human skeleton data into a skeleton diagram structure and identifying human body actions by applying GCN realizes a larger breakthrough in both the skeleton data coding mode and the space-time characteristic modeling compared with the previous methods based on a Recurrent Neural Network (RNN) model and a Convolutional Neural Network (CNN) model, the method still has difficulty in capturing richer, deep and effective space-time characteristics between joints in the actual processing and modeling process. For example, it is difficult to capture the dependency relationship between indirect joints, and it is also difficult to capture the time-series motion information of complex diversity in a large number of motion samples, and the local complex spatio-temporal fusion dependency relationship between joints without considering the difference of importance degree of each skeletal joint for different motion samples.

Based on this, an embodiment of the present invention provides a human motion recognition method, which can effectively improve the recognition performance and accuracy of human motion. Referring to fig. 1, the method of the embodiment of the present invention includes, but is not limited to, step S110, step S120, step S130, and step S140.

Specifically, the method application process of the embodiment of the invention includes, but is not limited to, the following steps:

s110: and constructing a local and non-local space-time graph convolution unit. The local and non-local space-time convolution units comprise local paths and non-local paths, the input ends of the local paths and the non-local paths are connected with the input ends of the local and non-local space-time graph convolution units, and the output of the local paths and the output of the non-local paths are aggregated to be used as the output of the local and non-local space-time graph convolution units.

S120: and inputting the skeleton diagram sequence into a local and non-local space-time diagram convolution unit to extract a first space-time characteristic.

S130: and inputting the first space-time characteristic into the global average pooling layer to obtain a second space-time characteristic.

S140: and inputting the second space-time characteristics into the full-link layer and the classifier in sequence to predict to obtain an identification result.

In this embodiment, referring to fig. 5, in this embodiment, a local and non-local spatiotemporal graph convolution unit (LNL-STGC) is first constructed, and then a skeleton diagram sequence of human motion is input into the LNL-STGC unit to perform spatiotemporal feature extraction, so as to obtain a first spatiotemporal feature. And further, inputting the first space-time characteristic into a global average pooling layer, and processing the first space-time characteristic by the global average pooling layer to obtain a second space-time characteristic. Then, the second spatiotemporal features are sequentially input into the full-link layer and the classifier, so that the recognition result of the skeleton diagram sequence of the human motion is obtained. Specifically, referring to fig. 3, the LNL-STGC unit in the present embodiment includes a local path and a non-local path. The input ends of the local path and the non-local path are both connected with the input end of the LNL-STGC unit, namely after the skeleton diagram sequence is input into the LNL-STGC unit, the skeleton diagram sequence is respectively input into the local path and the non-local path for feature extraction. The input characteristic diagram and the output characteristic diagram of the LNL-STGC unit are constructed by the number of channels C, the number of joints N and the length of a time frame T. In addition, the present embodiment aggregates the output of the local path with the output of the non-local path to obtain the output of the LNL-STGC unit. The embodiment extracts the local space-time information between joints through the local path. Meanwhile, the embodiment extracts the non-local space-time information between joints through the non-local path. Thereby extracting and obtaining local and non-local space-time information of the skeleton map sequence. Further, in this embodiment, feature information extracted from the local path and the non-local path is fused to obtain the first spatio-temporal feature. Then, the embodiment inputs the first time characteristic into the global average pooling layer, and performs down-sampling on the first time-space characteristic through the global average pooling layer to remove redundant information and compress the characteristic, thereby expanding the perception field. Further, in this embodiment, the second spatiotemporal features are sequentially input into the full-link layer and the classifier, the second spatiotemporal features are integrated through the full-link layer, and prediction classification is performed through the classifier to obtain the recognition result. It should be noted that the classifier may be a Softmax classifier, and the classification effect can be more obvious by the Softmax classifier, so that the identification accuracy is improved. In the embodiment, the local spatiotemporal information and the non-local spatiotemporal information of the skeleton map sequence are extracted to construct an effective spatiotemporal feature extraction unit, namely a local and non-local spatiotemporal map convolution unit, so that the accurate identification of the human motion action is realized, and the performance and the accuracy of the human motion identification are effectively improved.

In some embodiments of the present invention, local and non-local space-time graph convolution units are constructed, including but not limited to:

and constructing a local time-space domain.

And constructing a cross-space-time jump connecting edge according to the local space-time domain.

And performing space-time convolution construction on the cross-space-time jump connection edge to obtain a local convolution module.

And constructing a local path according to the local convolution module.

In this embodiment, a local time-space domain is first constructed, then a cross-space-time jump connection edge is constructed according to the local time-space domain, and a space-time convolution is performed on the cross-space-time jump connection edge to obtain a local convolution module, so that a local path is constructed according to the local convolution module. In particular, the human skeleton map can be represented as g (V, E), where the set of vertices V represents the human joints and the set of edges E represents the bones between the joints, and the adjacency matrix A ∈ R ^N×N Initialization, wherein A _i,j 1 denotes the slave node V _i To V _j Directly connected, otherwise 0. The adjacency matrix a is a symmetric matrix because the representation of the skeleton diagram is an undirected graph. An action is represented by a graph sequence with a feature set of nodes, where X _t,n Is node V _n The characteristics at time t. The action of the input is therefore ultimately fully represented by the adjacency matrix a and the node characteristics. Meanwhile, on the input skeleton defined by the adjacency matrix n and the feature X, the graph-convolution network layer-by-layer updating rule uses the feature at time t as represented by the following formula (1):

wherein in the formula (I), the compound has the structure shown in the specification,

is a skeleton diagram added with a self-loop,

is a matrix

σ () is the activation function.

Is the average feature after aggregating the features of all directly connected neighbor nodes.

Illustratively, the embodiment first sets a sliding window of τ frames on the input graph sequence, where the sliding window has a spatio-temporal sub-graph g at each time step _(τ) ＝(V _(τ) ,E _(τ) ) In which V is _(τ) ＝V ₁ ∪V ₂ ∪…∪V _τ And (4) the union of all nodes in the tau frame in the window, so as to construct and obtain the current local time-space domain. Further, g in τ frame _(τ) Each node within is closely connected to itself and to direct spatial neighbors. Then, the present embodiment obtains T windows using window sliding for all frames of the input data, and the obtained features can be represented as X _(τ) ∈R ^T*τN*C Wherein C represents the number of channels. Therefore, the time-space diagram convolution operation in the t-th time window can be expressed as shown in the following formula (2):

wherein the content of the first and second substances,

and

respectively represent the t-th time windowThe output profile and the input profile of the port,

representing a block of adjacency matrices built up from an adjacency matrix of the frame skeleton of tau, W ^(l) Is a weight matrix formed by stacking a plurality of weight vectors.

Further, after the local space-time domain is obtained through the space-time window construction, the joints in the multiple time frames are directly connected across space and time in the local space-time domain. First, the present embodiment performs group connection according to the spatial distance k between the joints. For example, for the current joint, the joints with the spatial distance of 1 are grouped together, and the joints with the spatial distance of 2 are grouped together, that is, all the joints with the same spatial distance are grouped together. The spatial distance between the directly connected joints is 1, when one joint exists between the two joints, the spatial distance between the two joints is 2, and so on. The embodiment effectively relieves the problem that the characteristics of the distant neighbor nodes are dominated by the characteristics of the closer neighbor nodes by grouping and connecting according to the joint-to-space distance, reduces the redundancy dependency, and can still keep the effectiveness of space-time characteristic modeling even under the condition that the distance k is larger with the help of the aggregation of the joint characteristics of a plurality of different distances. The specific calculation formula is shown in the following formula (3):

wherein K represents a predefined maximum spatial distance between the joints within a single frame,

and

respectively representing adjacent matrices

To the k power and to the k-1 power,

is that

Normalized diagonal matrix of (a).

Further, the present embodiment extends the direct connection between non-directly adjacent joints from the spatial dimension within a single frame to the spatio-temporal dimension constructed from multiple frames within a spatio-temporal window. Briefly, the calculation formula is integrated by formula (2) and formula (3), and the specific calculation formula is shown as the following formula (4):

wherein K represents a predefined maximum cross-spatiotemporal distance between joints in a multi-frame spatiotemporal structure,

is formed by

And

and adjacent matrix blocks with multiple time frames and multiple distances are integrated together. Thus, the present embodiment constructs a partial convolution module (L-STGC module) by building a direct cross-spatio-temporal jump connecting edge in the local spatio-temporal domain and performing spatio-temporal convolution on it. The embodiment constructs a local path through an L-STGC module to realize the capture of local complex space-time dependency relationship between joints.

In some embodiments of the present invention, constructing the local and non-local space-time graph convolution unit further includes, but is not limited to:

and constructing a non-local space-time graph convolution module.

And constructing a multi-scale time map convolution module.

And constructing a non-local path according to the non-local space-time graph convolution module and the multi-scale time graph convolution module. The input end of the non-local space-time graph convolution module is connected with the input end of the local and non-local space-time graph convolution unit, the output end of the non-local space-time graph convolution module is connected with the input end of the multi-scale time graph convolution module, and the output of the multi-scale time graph convolution module is used as the output of the non-local path.

In this specific embodiment, the non-local path of the LNL-STGC unit includes a non-local space-time graph convolution (NL-STGC) module and a multi-scale time graph convolution (MSTGC module). In this embodiment, a non-local path is constructed by constructing a non-local space-time graph convolution module and a multi-scale time graph convolution module, and constructing the non-local path according to the non-local space-time graph convolution module and the multi-scale time graph convolution module. Specifically, referring to fig. 3, in the present embodiment, the output terminal of the NL-STGC block is connected to the input terminal of the MSTGC block, and the output terminal of the MSTGC block is taken as the output of the non-local path. The embodiment captures the non-local spatial dependency relationship, the multi-scale time sequence motion characteristics and the long-term space-time dependency relationship by sequentially modeling through the NL-STGC module and the MSTGC module. In the embodiment, the NL-STGC module is used for extracting the dependency relationship between the indirectly connected joints in the skeleton diagram, and meanwhile, the MSTGC module is used for extracting the multi-scale time sequence motion information of the skeleton diagram sequence, so that richer and potential space-time characteristics between the joints are extracted, and the accuracy of human motion action identification is improved.

In some embodiments of the invention, a non-local space-time graph convolution module is constructed, including but not limited to:

and constructing non-local space map features according to the skeleton map sequence.

And constructing a mask matrix.

And processing the mask matrix through the first activation function to obtain first mask data.

And constructing and obtaining a non-local space-time graph convolution module according to the first mask data and the non-local space graph characteristics.

In this embodiment, the embodiment first depends on the skeletonThe graph sequence constructs non-local spatial graph features. Then, in this embodiment, a mask matrix is constructed, and the mask matrix is processed through a first activation function to obtain first mask data, so that a non-local space-time graph convolution (NL-STGC) module is constructed and obtained according to the first mask data and the non-local space graph characteristics. In particular, since in the fixed skeleton diagram there is only a direct connection between adjacent joints on a human physical structure, it is difficult to extract the dependency between non-directly adjacent joints, for example, for a "running" action, there is a strong correlation between the joints of the arm and leg, and none of these joints are directly connected in the fixed skeleton diagram. In this embodiment, a non-local spatial map feature is first constructed according to an input skeleton map sequence. Specifically, the present embodiment allows the receptive field of each joint to contain characteristics of all other joints by directly interconnecting all of the joints. Further, the present embodiment solves the problem of the consistency of the importance of all the inter-joint connections caused by directly connecting all the joints to each other by constructing a mask matrix M that can be learned, by adaptively learning the important connections between the joints while ignoring the interference of unimportant connections. Illustratively, non-local spatial map features to be constructed

Multiplying element by element with mask matrix processed by first activation function, wherein the first activation function can be tanh activation function, and obtaining adjusted mask matrix

Characterized by the following formula (5):

in some embodiments of the present invention, the multi-scale time map convolution module includes an expansion convolution sub-module, a convolution sub-module, and a pooling convolution sub-module, wherein input terminals of the expansion convolution sub-module, the convolution sub-module, and the pooling convolution sub-module are connected to an output terminal of the non-local space-time map convolution module, and the expansion convolution sub-module and the convolution sub-module are connected to an output terminal of the pooling convolution sub-module. Specifically, referring to fig. 4, in this particular embodiment, the MSTGC module includes an expansion convolution sub-module, a convolution sub-module, and a pooling convolution sub-module. The expansion convolution submodule, the convolution submodule and the pooling convolution submodule are connected in parallel, and the feature data output by the NL-STGC module are respectively input into the expansion convolution submodule, the convolution submodule and the pooling convolution submodule to be processed. The embodiment expands the receptive field without increasing the number of parameters by expanding the convolution sub-module. In addition, in this embodiment, the outputs of the dilation convolution submodule, the convolution submodule, and the pooling convolution submodule are connected, so as to extract and obtain information of different time scales. It is noted that in some embodiments of the present invention, the convolution submodule includes 1 × 1 convolution and the pooling convolution submodule includes 3 × 1 maximal pooling.

In some embodiments of the present invention, the extended convolution submodule includes a convolution operation layer, a first shift operation layer, a point-by-point convolution layer, and a second shift operation layer, and the convolution operation layer, the first shift operation layer, the point-by-point convolution layer, and the second shift operation layer are connected in sequence. The expansion convolution sub-modules are 4 in number and comprise a first expansion convolution sub-module, a second expansion convolution sub-module, a third expansion convolution sub-module and a fourth expansion convolution sub-module, the point-by-point convolution layer expansion rate of the first expansion convolution sub-module is 1, the point-by-point convolution layer expansion rate of the second expansion convolution sub-module is 2, the point-by-point convolution layer expansion rate of the third expansion convolution sub-module is 3, and the point-by-point convolution layer expansion rate of the fourth expansion convolution sub-module is 4. Specifically, referring to fig. 4 and 6, in the present specific embodiment, there are 4 expansion convolution sub-modules of the MSTGC module, which are respectively a first expansion convolution sub-module, a second expansion convolution sub-module, a third expansion convolution sub-module, and a fourth expansion convolution sub-module. Each expansion convolution submodule comprises a convolution operation layer, a first shift operation layer, a point-by-point convolution layer and a second shift operation layer, and the convolution operation layer, the first shift operation layer, the point-by-point convolution layer and the second shift operation layer are sequentially connected to construct a shift-convolution-shift structure, so that the expansion convolution submodule moves in different specified directions to discard some redundant features and reduce complexity and parameter quantity. In addition, the expansion rates of the point-by-point convolution layers of each expansion convolution submodule are different, the expansion rate of the point-by-point convolution layer of the first expansion convolution submodule is 1, the expansion rate of the point-by-point convolution layer of the second expansion convolution submodule is 2, the expansion rate of the point-by-point convolution layer of the third expansion convolution submodule is 3, the expansion rate of the point-by-point convolution layer of the fourth expansion convolution submodule is 4, and feature differences under different receptive fields are obtained through four expansion convolutions with different expansion rates, so that extraction of information of different time scales is achieved. It should be noted that, in some embodiments of the present invention, the convolution operation layer includes 1 × 1 convolution, and the point-by-point convolution layer includes 3 × 1 point-by-point convolution. The MSTGC module also comprises a residual convolution module, the characteristic data output by the NL-STGC module is input into the residual convolution module, and then the output of the residual convolution module is fused with the output obtained after the expansion convolution submodule, the convolution submodule and the pooling convolution submodule are connected to obtain the output of the MSTGC module, so that information of different time scales can be extracted simultaneously.

In some embodiments of the present invention, the local and non-local space-time graph convolution unit includes a first local and non-local space-time graph convolution unit, a second local and non-local space-time graph convolution unit and a third local and non-local space-time graph convolution unit, the first local and non-local space-time graph convolution unit has a characteristic channel of 96, the first local and non-local space-time graph convolution unit has a time convolution and space-time window step size of 1, the first local and non-local space-time convolution unit has an output end connected with a first batch of processing normalization modules and a second activation function in sequence, the second activation function has an output end connected with an input end of the second local and non-local space-time graph convolution unit has a characteristic channel of 192, the second local and non-local space-time graph convolution unit has a time convolution and space-time window step size of 2, and the second local and non-local space-time graph convolution unit has an output end connected with a second batch of processing normalization modules and a third laser in sequence And the output end of the third active function is connected with the input end of a third local and non-local space-time graph convolution unit, the characteristic channel of the third local and non-local space-time graph convolution unit is 384, the time convolution and space-time window step length of the third local and non-local space-time graph convolution unit is 2, and the output end of the third local and non-local space-time graph convolution unit is connected with the input end of the global average pooling layer. Specifically, referring to fig. 5, in the present embodiment, the human motion recognition model includes three LNL-STGC units, which are stacked by three LNL-STGC units, and the first local non-local space-time graph convolution unit, the second local non-local space-time graph convolution unit, and the third local non-local space-time graph convolution unit are sequentially connected to extract space-time features from the skeleton diagram sequence. The number of characteristic channels of the first local and non-local space-time graph convolution unit, the second local and non-local space-time graph convolution unit and the third local and non-local space-time graph convolution unit is respectively 96, 192 and 384. Meanwhile, except for the third local and non-local space-time convolution unit, batch processing normalization and activation functions are arranged at the tail ends, namely output ends, of the first local and non-local space-time graph convolution unit and the second local and non-local space-time graph convolution unit, namely a first batch processing normalization module and a second activation function are arranged at the output end of the first local and non-local space-time graph convolution unit, and a second batch processing normalization module and a third activation function are arranged at the output end of the second local and non-local space-time graph convolution unit. Wherein, the second activation function and the third activation function can be both ReLU activation functions. Meanwhile, except that the first local and non-local space-time diagram convolution unit uses the time convolution with the step length of 1 and the space-time window to carry out down sampling on time, the second local and non-local space-time diagram convolution unit and the third local and non-local space-time diagram convolution unit both use the time convolution with the step length of 2 and the space-time window to carry out down sampling on time, so that the high-level feature has a larger receptive field, and the extraction of richer action features is facilitated.

An embodiment of the present invention further provides a human motion recognition system, including:

and the space-time graph convolution unit construction module is used for constructing a local and non-local space-time graph convolution unit. The local and non-local space-time convolution units comprise local paths and non-local paths, the input ends of the local paths and the non-local paths are connected with the input ends of the local and non-local space-time graph convolution units, and the output of the local paths and the output of the non-local paths are aggregated to be used as the output of the local and non-local space-time graph convolution units.

And the characteristic extraction module is used for inputting the skeleton diagram sequence into the local and non-local space-time diagram convolution unit to extract the first space-time characteristic.

And the pooling module is used for inputting the first time-space characteristic into the global average pooling layer to obtain a second time-space characteristic.

And the prediction output module is used for sequentially inputting the second time-space characteristics into the full-link layer and the classifier to predict to obtain an identification result.

Referring to fig. 2, an embodiment of the present invention further provides a human motion action recognition system, including:

at least one process 210.

At least one memory 220 for storing at least one program.

When the at least one program is executed by the at least one processor 210, the at least one processor implements the human motion action recognition method as described in the above embodiments.

An embodiment of the present invention also provides a computer-readable storage medium storing computer-executable instructions for execution by one or more control processors, e.g., to perform the steps described in the above embodiments.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims

1. A human motion action recognition method is characterized by comprising the following steps:

2. The human motion action recognition method of claim 1, wherein the constructing of the local and non-local space-time graph convolution unit comprises:

constructing a local time-space domain;

and constructing the local path according to the local convolution module.

3. The human motion action recognition method of claim 2, wherein the constructing of the local and non-local space-time graph convolution unit further comprises:

constructing a non-local space-time graph convolution module;

constructing a multi-scale time map convolution module;

4. The human motion action recognition method of claim 3, wherein the constructing of the non-local space-time graph convolution module comprises:

constructing a mask matrix;

5. The human motion action recognition method according to claim 3, wherein the multi-scale time map convolution module comprises an expansion convolution sub-module, a convolution sub-module and a pooling convolution sub-module, wherein input ends of the expansion convolution sub-module, the convolution sub-module and the pooling convolution sub-module are connected with an output end of the non-local space-time map convolution module, and the expansion convolution sub-module and the convolution sub-module are connected with an output end of the pooling convolution sub-module.

6. The human motion action recognition method according to claim 5, wherein the dilation convolution sub-module includes a convolution operation layer, a first shift operation layer, a point-by-point convolution layer, and a second shift operation layer, and the convolution operation layer, the first shift operation layer, the point-by-point convolution layer, and the second shift operation layer are connected in sequence; the expansion convolution sub-modules are 4 in number and comprise a first expansion convolution sub-module, a second expansion convolution sub-module, a third expansion convolution sub-module and a fourth expansion convolution sub-module, the dot-by-dot convolution layer expansion rate of the first expansion convolution sub-module is 1, the dot-by-dot convolution layer expansion rate of the second expansion convolution sub-module is 2, the dot-by-dot convolution layer expansion rate of the third expansion convolution sub-module is 3, and the dot-by-dot convolution layer expansion rate of the fourth expansion convolution sub-module is 4.

7. The human motion action recognition method according to claim 6, wherein the local and non-local spatio-temporal graph convolution units comprise a first local and non-local spatio-temporal graph convolution unit, a second local and non-local spatio-temporal graph convolution unit and a third local and non-local spatio-temporal graph convolution unit, the first local and non-local spatio-temporal graph convolution unit has a characteristic channel of 96, the first local and non-local spatio-temporal graph convolution unit has a time convolution and a space-time window step of 1, the first local and non-local spatio-temporal convolution unit has an output end connected with a first batch of processing normalization modules and a second activation function in sequence, the second activation function has an output end connected with an input end of the second local and non-local spatio-temporal graph convolution unit, the second local and non-local spatio-temporal graph convolution unit has a characteristic channel of 192, the second local and non-local spatio-temporal graph convolution unit has a time convolution and space-time window step of 2, the output end of the second local and non-local space-time convolution unit is sequentially connected with a second batch processing normalization module and a third activation function, the output end of the third activation function is connected with the input end of the third local and non-local space-time graph convolution unit, the characteristic channel of the third local and non-local space-time graph convolution unit is 384, the time convolution and space-time window step length of the third local and non-local space-time graph convolution unit is 2, and the output end of the third local and non-local space-time graph convolution unit is connected with the input end of the global average pooling layer.

8. A human motion action recognition system, comprising:

9. A human motion action recognition system, comprising:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor may implement the human motion action recognition method according to any one of claims 1 to 7.

10. A computer storage medium in which a processor-executable program is stored, wherein the processor-executable program, when executed by the processor, is for implementing the human motion action recognition method of any one of claims 1 to 7.