LU101933B1

LU101933B1 - Human action recognition method, human action recognition system and equipment

Info

Publication number: LU101933B1
Application number: LU101933A
Authority: LU
Inventors: Cheng Yang; Hong Zhou
Original assignee: Univ Zhejiang
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2021-06-08
Also published as: LU101933A1

Abstract

This invention provides a human action recognition method, a human action recognition system and equipment. The human action recognition method includes: obtaining a video including human action, and resampling and preprocessing a video frame; extracting an image feature of the video frame; obtaining a human keypoint sequence corresponding to the video frame based on human skeletal information; and inputting the image feature and the human keypoint sequence into a graph convolutional neural network to obtain an action category. By constructing a model from the human keypoint sequence and the image feature including environmental information, robustness against the change in the environment is ensured and the environmental information is fully utilized for the human action recognition, thus the accuracy is high.

Description

HUMAN ACTION RECOGNITION METHOD, HUMAN ACTION RRECOGNITION SYSTEM LU101933

AND EQUIPMENT

TECHNICAL FIELD The present application relates to a technical field of image processing, and more particularly, to a human action recognition method, a human action recognition system and equipment.

BACKGROUNDT At present, an action recognition method in a field of human recognition mainly includes human action recognition based on a RGB image and human action recognition based on a human skeleton.

Among them, a RGB image sequence is taken as an input in the human action recognition based on the RGB image. A convolutional neural network is usually used for end-to-end training and learning in the most effective method, that is, feature extraction and action classification are completed at the same time. This is advantageous for the fact that all features in the environment are extracted, which provides more complete information for achieving accurate human action recognition. However, it is difficult to extract features of the human itself as all of the features in the environment are extracted in this method, thus the accuracy of the human action recognition is greatly affected by the environment, for example change in illumination, blockage and other factors, resulting in being lack of robustness.

In the action learning based on the human skeleton, a human keypoint sequence is taken as an input which only includes abstract information such as 2D or 3D coordinates of human keypoints, in such a way that the influence of environmental noise is reduced and a more robust action algorithm can be built, however, the accuracy of the human action recognition that requires environmental information is low due to the absence of the environmental information. The action cannot be completely defined by the movement of the human. In the real world, action done in different environments may have different meanings. In order to achieve accurate human action recognition, the environmental information is required for auxiliary information, while all environmental information is lacked in the human action recognition based on the human skeleton.

SUMMARY et — — —_— " _ _ _ — "_ " " — " — —"— """"""#

To overcome the shortcomings of the prior art, this invention provides a human action recognition method, a human action recognition system, equipment, and a readable storage medium. 01999 A model mainly on the human action recognition based on a human skeleton is built, and environmental information is encoded into the model in a proper way, so as to achieve robustness against change in the environment and fully utilized environmental information for the human action recognition.

In order to achieve the above objects, an embodiment of this invention provides a human action recognition method including: obtaining a video including human action, and resampling and preprocessing a video frame; extracting an image feature of the video frame; obtaining a human keypoint sequence corresponding to the video frame based on human skeletal information; and inputting the image feature and the human keypoint sequence into a graph convolutional neural network to obtain an action category.

Optionally, specific steps of obtaining the action category include: obtaining a first vector, wherein the first vector represents the image feature of the video frame; using the human keypoint sequence to construct a human keypoint graph; inputting the human keypoint graph into the graph convolutional neural network to generate a second vector; connecting the first vector and the second vector and inputting them into a fully connected layer to generate a third vector; and inputting the third vector into a classifier to obtain the predicted action category.

Optionally, specific steps of using the human keypoint sequence to construct the human keypoint graph include: recording the constructed human keypoint graph as G=(V,E), where V is a set of vertices of the graph, V={Vilt=1,...,T,i=1,...,N}, T is the number of skeletal sequences, N is the number of keypoints detected in one picture; Vii is a i-th keypoint in a t-th picture; E represents edges of the graph, which is composed of two parts: a connection status E; of keypoints in one frame and a connection state Ep of the keypoints in different frames, wherein Eı is a physical connection state of different keypoints in one frame and Ep is a virtual physical connection of the same keypoint in -—_—_—_—_—_—_—_—_————————————

| different frames defined for facilitating subsequent capture of a timing feature; and in a process of implementation, using an N*N adjacency matrix À to describe the connection state of the keypoints 1101983 in one frame, wherein Aj is 1 if there is a physical connection between a keypoint i and a keypoint J, otherwise it is 0. Optionally, specific steps of generating the second vector include: stacking graph convolutional layers to form the graph convolutional neural network and performing the same operation in each graph convolutional layer; in each graph convolution layer, performing operation in two different dimensions, wherein one is to perform a graph convolution in a spatial dimension, and the other is to perform an ordinary convolution in a temporal dimension; and transforming an output of a graph neural network module to obtain the second vector.

Optionally, specific steps of performing the graph convolution in the spatial dimension include: in the spatial dimension, performing the graph convolution on each frame of the human keypoint graph to capture connection between different keypoints, and specific implementation satisfying the following formula: Xout = D72(1 + A)DXinW wherein in the above formula, I is an identity matrix, A is an adjacency matrix, D is a degree matrix, Di = ÿ,(AÏ +17), and Xin is an input which is an N*U tensor; and W is a weight parameter of the graph convolution layer for transforming the feature.

Optionally, specific steps of performing the ordinary convolution in the temporal dimension include: in the temporal dimension, performing the ordinary convolution on the same keypoint between adjacent frames to capture change in each keypoint over time.

Optionally, specific steps of obtaining the first vector include: selecting a plurality of pictures from the video frames and respectively inputting the pictures into a ResNet-50 pre-trained on an Imagenet database, using an output of a last fully connected layer as the image feature to obtain a plurality of initial vectors, and averaging the initial vectors to obtain the first vector.

The embodiment of this invention further provides a human action recognition system including: I...

| a video frame obtaining module, configured to obtain a video including human action, and LU101933 resample and preprocess a video frame; an image feature extracting module, configured to extract an image feature of the video frame; a human keypoint sequence extracting module, configured to obtain a human keypoint sequence corresponding to the video frame based on human skeletal information; and an action category obtaining module, configured to input the image feature and the human keypoint sequence into a deep neural network to obtain an action category. The embodiment of this invention further provides human action recognition equipment, including: a memory, a processor, and a human action recognition program stored in the memory and run on the processor, wherein the steps of the above human action recognition method are implemented when the human action recognition program is run by the processor. The embodiment of this invention further provides a computer readable storage medium, wherein a human action recognition program is stored in the computer readable storage medium, and the steps of the above human action recognition method are implemented when the human action recognition program is run by a processor. Beneficial effects of this invention are as follow. By extracting the human keypoint sequence and the image feature of the video frame and inputting them into the graph convolutional neural network to predict the action category, and by constructing a model from the human keypoint sequence and the image feature including the environmental information, the robustness against the change in the environment is ensured and the environmental information is fully utilized for the human action recognition, thus the accuracy is high. This invention will be further described in detail below in connection with drawings and embodiments, for the purpose of better understanding the objects, technical solutions, and beneficial effects of this invention.

BRIEF DESCRIPTION OF DRAWINGS FIG. 1 is a schematic flowchart of a human action recognition method according to one embodiment of this invention; FIG. 2 is a schematic flowchart of a specific method for obtaining an action category according to one embodiment of the present invention; LU101933 FIG. 3 is a constructed human keypoint graph according to one embodiment of this invention; and FIG. 4 is a structural block diagram of a human action recognition system provided by one 5 embodiment of this invention.

DETAILED DESCRIPTION ; In human action recognition based on a RGB image in the prior art, it is difficult to extract features of the human itself as all of the features in the environment are extracted, thus accuracy of the human action recognition is greatly affected by the environment, resulting in being lack of robustness.

In action learning based on a human skeleton, although a more robust action algorithm can be built, the accuracy of the human action recognition that requires environmental information is low due to the absence of the environmental information.

Therefore, according to the embodiment of this invention, a model mainly for the human action recognition based on the human skeleton is built, and environmental information is encoded into the model in a proper way, so as to achieve robustness against change in the environment and fully utilized environmental information for the human action recognition.

FIG. 1 is a schematic flowchart of a human action recognition method according to an embodiment of this invention, specifically including: S10, obtaining a video including human action, and resampling and preprocessing a video frame; S20, extracting an image feature of the video frame; S30, obtaining a human keypoint sequence corresponding to the video frame based on human skeletal information; and S40, inputting the image feature and the human keypoint sequence into a graph convolutional neural network to obtain an action category.

Specifically, firstly, performing S10 is to resample and preprocess the video frame on the video including the human action.

In this embodiment, for a video including the human action, an opencv library is used to sample the video at 25 frames per second, and a video frame image sequence is arranged in chronological order.

At the same time, the opencv library is used to preprocess and scale all images A...

so that the resolution of individual images is 224*224. LU101933 In other embodiments, other vision libraries may also be adopted to sample the video, and the resolution may also be set to other values. Performing S20 is to extract the image feature of the video frame for n pictures randomly selected from the video frame image sequence obtained in step S101. In this embodiment, 3 pictures are selected for the extraction of the image feature, and in other embodiments, 4 pictures, 5 pictures, 6 pictures, etc may be selected. However, it is not very necessary to select a large number of pictures due to excessive calculation. Therefore, in this embodiment, 3 pictures are selected for extraction. In this embodiment, extracting the image feature of the video frame specifically includes the following steps: respectively inputting the 3 pictures into a ResNet-50 pre-trained on an Imagenet database, using an output of the last fully connected layer as the image feature to obtain three 2048-dimensional vectors, and averaging the three vectors to obtain one 2048-dimensional vector which is recorded as an environmental vector Xcon.

In this embodiment, ResNet-50 is adopted to extract the image feature, and in other embodiments, other conventional methods such as SIFT algorithm, etc. may also be adopted, and other pre-trained deep models such as VGG-19, ResNet- 152, etc. may also be adopted to extract the feature.

The environmental vector Xeon is input into an encoder composed of two-layered fully connected layer, and a K-dimensional first vector Xc is output.

There may be a very large spatial mismatch between the extracted image features and the human skeleton features extracted by the graph convolutional network, thus it is difficult to learn subsequent feature fusion. As it is difficult to accurately map features from two different spaces into one space, a learnable encoder is used to learn from the data how to map features extracted from two different networks into one latent space.

Performing S30 is to obtain the human keypoint sequence corresponding to the video frame based on the human skeletal information.

In this embodiment, an openpose algorithm is used to obtain the human keypoint sequence from the selected pictures and 15 keypoints are detected in each picture. The openpose algorithm, an open source algorithm for human pose estimation proposed by Carnegie Mellon University, is used 6,

| to detect human keypoints and output 2D or 3D coordinates of the human keypoint. In other embodiments, other algorithms may also be adopted to obtain the human keypoint 7101933 sequence. Among them, S20 and S30 are in no particular order and can be interchanged. Performing S40 is to input the image feature and the human keypoint sequence into the graph convolutional neural network to obtain the action category.

Referring to FIG. 2, obtaining the action category specifically includes the following steps.

S41, obtaining a first vector, wherein the first vector represents the image feature of the video frame.

The environmental vector Xcon is input into the encoder composed of the two-layered fully connected layer, and the K-dimensional first vector Xc is output.

There may be a very large spatial mismatch between the extracted image features and the human skeleton features extracted by the graph convolutional network, thus it is difficult to learn subsequent feature fusion. As it is difficult to accurately map features from two different spaces into one space, a learnable encoder is used to learn from the data how to map features extracted from two different networks into one latent space. Obtaining the first vector and obtaining the second vector are in no particular order, which may be performed at the same time or in any order.

S42, using the human keypoint sequence to construct the human keypoint graph.

Please refer to FIG. 3 which is the constructed human keypoint graph. The constructed human keypoint graph is recorded as G=(V,E), where V is the set of vertices of the graph, V={Vit=1,...,T,i=1,....N}, T is the number of skeletal sequences, N is the number of keypoints detected in one picture; Vu is the i-th keypoint in the t-th picture; E represents edges of the graph, which is composed of two parts: a connection status Ei of the keypoints in one frame and a connection state Ep of the keypoints in different frames, wherein Eı is a physical connection state of different keypoints in one frame and Ep is a virtual physical connection of the same keypoint in different frames defined for facilitating subsequent capture of a timing feature; and in a process of implementation, an N*N adjacency matrix A is used to describe the connection state, and Ajj is 1 if there is a physical connection between a keypoint i and a keypoint j, otherwise it is 0.

S43, inputting the human keypoint graph into the graph convolutional neural network to generate the second vector. The graph convolutional neural network is formed by stacking graph convolutional layers, and

| the same operation is performed in each graph convolutional layer; and in each graph convolution layer, operation is performed in two different dimensions, one is to perform a graph convolution in à 7101933 spatial dimension, and the other is to perform an ordinary convolution in a temporal dimension. Among them, specific steps of performing the graph convolution in the spatial dimension include: in the spatial dimension, performing the graph convolution on each frame of the human keypoint graph to capture connection between different keypoints, and specific implementation satisfying the following formula: Xour = DI + A)D Xin W In the above formula, I is an identity matrix, that is, a matrix in which diagonal elements are 1 and the other elements are 0. It is a self-connecting matrix here, that is, each vertex is connected to itself. A is an adjacency matrix which represents the connection state. Ajj is 1 if there is a physical connection between the keypoint i and the keypoint j, otherwise it is 0. D is a degree matrix, which describes how many edges are connected by each vertex, only the diagonal elements are not 0, and other elements are 0, Dif = Y,(Aÿ +15). Xin is an input which is an N*U tensor. W is a weight parameter of the graph convolution layer for transforming the feature.

Among them, specific steps of performing the ordinary convolution operation in the temporal dimension include: in the temporal dimension, performing the ordinary convolution on the same keypoint between adjacent frames to capture the change in each keypoint over time.

The reason for performing different convolutions in two different dimensions is that the main object of performing the convolution in the spatial dimension is to capture the connection between different keypoints while the object of performing the convolution in the temporal dimension is to capture the movement of the keypoint over time. Two different convolutions are performed in two dimensions is because that the action is a dynamic process composed of the relationship between keypoints in the space and the change in the keypoint over the time. In addition, different convolutions are used because different convolutions have different inputs. An input of the convolution in the spatial dimension is different keypoints at the same time point, and the keypoints are connected in the form of a graph, thereby using the graph convolution; while an input in the temporal dimension is the same keypoint at different time points, thereby using the ordinary convolution.

EL LL

| Dimensional transformation is performed on the output of a graph neural network module to obtain a K-dimensional second vector Xk. LU101933 There may be a very large spatial mismatch between the extracted image features and the human skeleton features extracted by the graph convolutional network, thus it is difficult to learn subsequent feature fusion. As it is difficult to accurately map features from two different spaces into one space, a learnable encoder is used to learn from the data how to map features extracted from two different networks into one latent space. S44, connecting the first vector and the second vector and inputting them into the fully connected layer to generate a third vector. The first vector Xe and the second vector Xx are connected and input into a fully connected layer with Q neurons, and a Q-dimensional third vector Xq is output. S45, inputting the third vector into a classifier to obtain the predicted action category. According to this invention, the image feature of the environmental information is encoded into the model for the human action recognition based on the skeleton, so that the environmental information and human skeletal information are utilized at the same time, thus robustness against the change in the environment is ensured and the environmental information is fully utilized. Please refer to FIG. 4, the embodiment of this invention further provides a human action recognition system including: a video frame obtaining module 10, configured to obtain a video including human action, and resample and preprocess a video frame; an image feature extracting module 20, configured to extract an image feature of the video frame; a human keypoint sequence extracting module 30, configured to obtain a human keypoint sequence corresponding to the video frame based on human skeletal information; and an action category obtaining module 40, configured to input the image feature and the human keypoint sequence into a deep neural network to obtain an action category.

The embodiment of this invention further provides human action recognition equipment including: a memory, a processor, and a human action recognition program stored in the memory and run on the processor, and steps of the above human action recognition method are implemented when the human action recognition program is run by the processor. The embodiment of this invention further provides a computer readable storage medium. A

10 | human action recognition program is stored in the computer readable storage medium, and steps of the above human action recognition method are implemented when the human action recognition U101953 program is run by a processor.

It is indicated that, in this description, the terminology of “including”, “containing” or variations thereof is meant to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes listed elements but also includes other element(s) not being explicitly listed or inherent element(s) of the process, method, article or system. In the absence of more restrictive conditions, the element limited by the phraseology “including one” does not exclude the existence of additional identical element(s) in the process, method, article or system including the element. The sequence numbers of the embodiments of this invention are merely for the convenience of description, and do not imply the preference among the embodiments. According to the description of the embodiments mentioned above, those skilled in the art may clearly understand that those embodiments mentioned above may be implemented by the software together with essential general hardware platform, or be implemented by hardware. However, in the most situations, the former is preferred. The part of technical solution of this invention that contributes to the prior art may be essentially implemented by the software. The computer software product is stored in a storage medium (for instance, ROM/RAM, a disc, a disk), it includes several instructions so that a terminal device may execute the method described in each embodiment of this invention by a terminal device (for instance, may be a phone, computer, server, air-conditioner, or an internet device and the like).

Although this invention has been described in considerable detail with reference to certain preferred embodiments thereof, the disclosure is not for limiting the scope of the invention. Persons having ordinary skill in the art may make various modifications and changes without departing from the scope and spirit of the invention. Therefore, the scope of the appended claims should not be limited to the description of the preferred embodiments described above.

Claims

Claims LU101933

1. A human action recognition method, characterized by comprising the following steps: obtaining a video comprising human action, and resampling and preprocessing a video frame; extracting an image feature of the video frame; obtaining a human keypoint sequence corresponding to the video frame based on human skeletal information; and inputting the image feature and the human keypoint sequence into a graph convolutional neural network to obtain an action category.

2. The human action recognition method according to claim 1, characterized in that specific steps of obtaining the action category comprise: obtaining a first vector, wherein the first vector represents the image feature of the video frame; using the human keypoint sequence to construct a human keypoint graph; inputting the human keypoint graph into the graph convolutional neural network to generate a second vector; connecting the first vector and the second vector and inputting them into a fully connected layer to generate a third vector; and inputting the third vector into a classifier to obtain the predicted action category.

3. The human action recognition method according to claim 2, characterized in that specific steps of using the human keypoint sequence to construct the human keypoint graph comprise: recording the constructed human keypoint graph as G=(V,E), where V is a set of vertices of the graph, V={Vit=1,...,T,i=1,...,N}, T is the number of skeletal sequences, N is the number of keypoints detected in one picture; Vy is a i-th keypoint in a t-th picture; E represents edges of the graph, which is composed of two parts: a connection status E; of keypoints in one frame and a connection state Ep of the keypoints in different frames, wherein F1 is a physical connection state of different keypoints in one frame and Ep is a virtual physical connection of the same keypoint in different frames defined for facilitating subsequent capture of a timing feature; and in a process of implementation, using an N*N | adjacency matrix À to describe the connection state of the keypoints in one frame, wherein | Ajj is 1 if there is a physical connection between a keypoint i and a keypoint j, otherwise it is 0. LU101933

4. The human action recognition method according to claim 2, characterized in that specific steps of generating the second vector comprise: stacking graph convolutional layers to form the graph convolutional neural network and performing the same operation in each graph convolutional layer; in each graph convolution layer, performing operation in two different dimensions, wherein one is to perform a graph convolution in a spatial dimension, and the other is to perform an ordinary convolution in a temporal dimension; and transforming an output of a graph neural network module to obtain the second vector.

5. The human action recognition method according to claim 4, characterized in that specific steps of performing the graph convolution in the spatial dimension comprise: in the spatial dimension, performing the graph convolution on each frame of the human keypoint graph to capture connection between different keypoints, and specific implementation satisfying the following formula: Xout = D2(I + A)D Xiu W wherein in the above formula, I is an identity matrix, A is an adjacency matrix, D is a degree matrix, DI = ¥;(AV +19), and Xin is an input which is an N*U tensor; and W is a weight parameter of the graph convolution layer for transforming the feature.

6. The human action recognition method according to claim 4, characterized in that specific steps of performing the ordinary convolution in the temporal dimension comprise: in the temporal dimension, performing the ordinary convolution on the same keypoint between adjacent frames to capture change in each keypoint over time.

7. The human action recognition method according to claim 1, characterized in that specific steps of obtaining the first vector comprise: selecting a plurality of pictures from the video frames and respectively inputting the pictures into a ResNet-50 pre-trained on an Imagenet database, using an output of a last fully connected layer as the image feature to obtain a plurality of initial vectors, and averaging the initial vectors to obtain the first vector.

8. A human action recognition system characterized by comprising: a video frame obtaining module, configured to obtain a video comprising human action, and resample and preprocess a video frame; | an image feature extracting module, configured to extract an image feature of the video 01933 frame; a human keypoint sequence extracting module, configured to obtain a human keypoint sequence corresponding to the video frame based on human skeletal information; and an action category obtaining module, configured to input the image feature and the human keypoint sequence into a deep neural network to obtain an action category.

9. Human action recognition equipment, characterized by comprising: a memory, a processor, and a human action recognition program stored in the memory and run on the processor, wherein steps of the human action recognition method according to any one of claims 1-7 are implemented when the human action recognition program is run by the processor.

10. A computer readable storage medium, characterized in that, a human action recognition program is stored in the computer readable storage medium, and steps of the human action recognition method according to any one of claims 1-7 are implemented when the human action recognition program is run by a processor. |