CN113486706B

CN113486706B - Online action recognition method based on human body posture estimation and historical information

Info

Publication number: CN113486706B
Application number: CN202110558936.6A
Authority: CN
Inventors: 冯伟; 孙佳敏; 边存灵
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2022-11-15
Anticipated expiration: 2041-05-21
Also published as: CN113486706A

Abstract

The invention relates to an online action recognition method based on human body posture estimation and historical information, which comprises the following steps: constructing and training an online action recognition model: for an input video, a skeleton sequence is extracted through a human body posture estimation algorithm, then online action recognition is realized, and the type of an action recognition result is given, wherein the method comprises the following steps: collecting original motion video data: 3D skeleton data generated by a posture estimation algorithm is used as an original data set; constructing a high-quality action recognition guidance module; constructing a low-quality robust action recognition module; and constructing an online action recognition module. And testing the online action recognition model.

Description

Online action recognition method based on human body posture estimation and historical information

Technical Field

The invention is mainly applied to the field of motion recognition, and relates to a graph convolution neural network technology, a long-term and short-term memory neural network technology and a knowledge distillation technology in the field of artificial intelligence. The method can be used for human body action recognition application in the field of video processing.

Background

In recent years, with the rapid development of artificial intelligence, human body motion recognition has made great progress, and especially in application scenarios such as intelligent security monitoring, man-machine interaction, education, intelligent medical treatment and the like, the human body motion recognition plays an increasingly important role, receives attention of numerous scholars and researchers, and becomes an active research field.

The background art related to the invention is as follows:

(1) Estimating the posture of the human body: the human body posture estimation is to extract motion and action data of a human body in a video through a posture estimation algorithm, the extracted human body motion data are presented in a 3D framework sequence, the 3D framework sequence is formed by connecting a plurality of human body joint points, each joint point comprises space coordinate data of a human body joint, and the continuous multi-frame 3D framework sequence can simply and efficiently represent human body motion characteristics, namely action information. The human posture estimation can effectively help the model classifier to perform high-precision motion recognition on motion recognition.

(2) Knowledge distillation algorithm: the knowledge distillation method can transfer the motion representation capability of the model to a target network model (namely a robust motion recognition model), and the robust motion recognition model is used for initializing the continuous motion recognition model, so that the on-line recognition of human motion in a video can be realized, and an on-line motion recognition task is supported.

(3) STGCN network: the STGCN network has good effect on human body action recognition task ^[1] The human body action recognition model is a classic behavior recognition network model applied to the human body skeleton, the model design generalization capability is strong, and the accuracy of human body action recognition can be improved by extracting and utilizing the characteristics of the skeleton sequence from two aspects of space and time.

The related documents are:

[1]Yan S,Xiong Y,Lin D.Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition[J].2018.

disclosure of Invention

The invention provides an online action recognition method, which comprises the steps of extracting a framework coordinate of each frame of image in a video by utilizing a posture estimation algorithm, constructing a framework space-time diagram of the framework coordinate by utilizing a deep-learning space-time diagram convolution network model, transferring the action representation capability of the model to a target network model by using a knowledge distillation method, finally constructing a continuous action recognition prediction model, and using a robust action recognition model to initialize the continuous action model, so that the online recognition of human actions in the video can be realized, and an online action recognition task is supported.

The invention adopts the following technical scheme:

an online action recognition method based on human body posture estimation and historical information comprises the following steps:

(1) Constructing and training an online action recognition model: for an input video, a skeleton sequence is extracted through a human body posture estimation algorithm, then online action recognition is realized, and the type of an action recognition result is given, wherein the method comprises the following steps:

a) Collecting original motion video data: obtaining 3D skeleton coordinate data through a human body posture estimation algorithm, and taking the 3D skeleton data generated by the posture estimation algorithm as an original data set;

b) Constructing a high-quality action recognition guidance module: processing the original 3D skeleton data extracted in the step a) to construct an accurate action segmentation training set V ₁ The teacher network model is mainly used for training a teacher network model, the teacher network model is constructed by selecting an STGCN network, the teacher network is trained by adopting a training strategy with a single label, and finally a high-quality action representation guidance model is obtained.

c) Constructing a low-quality robust action recognition module: processing the 3D skeleton data of the original data set to generate a preamble action training set V ₂ Selecting STGCN network to build student network model, and adopting single label training strategy in training set V ₂ Training, namely using a knowledge distillation method to guide the training of the student network model by using the high-quality motion representation guidance model generated in the step b), and then obtaining a robust motion recognition model.

d) Constructing an online action recognition module: processing original 3D skeleton data to generate a new training set V ₃ ，V ₃ The training set is a continuous action training set, the training set comprises continuous multi-section action 3D framework sequences, then an LSTM-STGCN is built by using an STGCN network and the LSTM network, in the implementation of a specific building structure of the built action prediction model, the LSTM network is accessed behind a full connection layer of the STGCN, the output of the STGCN is used as the input of the LSTM, firstly, the LSTM-STGCN model is initialized, the robust action recognition model obtained in the step c) is used as an LSTM-STGCN framework feature extraction module, then, the parameters of the robust action recognition model are loaded into the LSTM-STGCN, the constructed continuous action training set is used for training, and a final target model is obtained through a multi-label classification training strategy: an online motion recognition model.

(2) And (3) testing the online action recognition model: and for the input online action video, obtaining a 3D skeleton data sequence of the human body action through a human body posture estimation algorithm, sending the 3D skeleton sequence into an online action recognition model, outputting the action category, and completing the recognition of the online action.

Further, in the step a), 3D skeleton data is extracted by using a Kinectv2 sensor pose estimation algorithm.

Further, in the step b), the step c) and the step D, a training set in a tensor form is constructed according to the coordinate data of the 3D framework, and the data structure of the training set is as follows: c multiplied by T multiplied by V multiplied by M, wherein C represents 3 channels, T represents the frame number of data, V represents the number of skeleton joint points, M represents the number of characters in a video, a precise action segmentation training set is constructed, segmentation is carried out according to action class labels, 3D skeleton data of each action sequence only comprises one action class, the constructed front action training set is used for intercepting a section of 3D skeleton sequence comprising the previous action in the original 3D skeleton sequence by changing the size of T, the frame number of the action sequence of the front action sequence training set is T + B, B is a set number, and the front action training set in a tensor form is generated according to a data format.

Furthermore, in the step c, the high-quality action representation guidance model is used for guiding the training of the student model, and the realization method is that for the construction of the loss function of the training model, the total loss function L of the student network model _total ＝αL _student +βL _kl +γL _mse Wherein L is _kl As a function of divergence loss, L _student A loss term of the convolutional network of the student network model space-time diagram; l _mse The method is a mean square loss function and is used for calculating three hyper-parameters set by loss, alpha, beta and gamma of the square of the extracted features of the student network model and the square of the extracted features of the teacher network model, and a robust action recognition model is obtained by training on the data set and using a single-label training strategy.

Further, in the step d), a continuous action recognition model is constructed, an LSTM network is constructed in the STGCN network to realize the continuous sequence label classification function, namely, an LSTM-STGCN model, the model initializes an LSTM-STGCN skeleton feature extraction module by using the robust action recognition model obtained by the previous training, and trains on the constructed continuous action training set based on the LSTM modeling historical behavior information, and finally obtains a final target model: an online motion recognition model.

Compared with the prior art, the method has the following advantages:

(1) The invention realizes the online identification of the human body action in the video. In the past, the action recognition can be realized only by observing the complete action video, and the method has the advantages that the online action recognition can be realized, namely, the action can be classified and recognized without observing the action in the complete video, and the action can be judged in advance in a practical application scene.

(2) The invention utilizes the training method of knowledge distillation to improve the robustness and the recognition accuracy of the network model. The teacher network model with strong action representation capability guides the training of the student network model, so that the student network model can learn good action representation capability, and the robustness of the model is improved while the identification accuracy is improved.

(3) The online action recognition model of the invention combines historical action information to recognize, and realizes recognition and prediction of each frame of video.

Drawings

FIG. 1: flow chart of online action recognition method based on human body posture estimation and historical information

Detailed Description

The invention provides an online action recognition method based on human body posture estimation and historical information, which is different from the current existing method in that the online action recognition is realized by utilizing a knowledge distillation technology and an LSTM sequence classification technology, and meanwhile, the robustness and the generalization capability of a network are improved. The technical scheme of the invention is clearly and completely described below with reference to the accompanying drawings. The technical method and the beneficial progress of the invention are all within the protection scope of the invention.

1. Constructing and training an online action recognition model:

as shown in fig. 1, for an input video, a skeleton sequence of human body motion in the video is first extracted,

the skeleton sequence is constructed into a training set which can be used for model training. Then different models are built, and a knowledge distillation method and different training strategies are used, so that the target model required by the invention is obtained: an online motion recognition model.

1) And (3) construction of a training set:

the method is mainly divided into three types, namely, a precise action segmentation training set V for a teacher network model is constructed ₁ And a preceding text action training set V for the student network model ₂ Continuous action training set V for training on-line action recognition model ₃ . Wherein V ₁ The method is characterized in that each action in a collected video data set is accurately divided, namely each piece of video action skeleton data and an action label are single and do not contain data of other actions. V for training student network model ₂ It means that each action segment of the training set includes not only the current action but also the last action segment before the action starts and the 3D skeleton data of the current action. V ₃ The training set is a 3D skeleton data sequence comprising a plurality of action sequences and is not a single segmented action sequence.

2) Data structure of training set:

the data structure of the training set is set as: c is multiplied by T by V and M is multiplied by M, wherein C represents 3 channels, T represents the frame number of data, V represents the number of skeleton joint points, 25 3D skeleton joint points are selected, and M represents the number of people in a video. And constructing an accurate action segmentation training set, segmenting according to the action class labels, wherein the 3D skeleton data of each action sequence only comprises one action class. The constructed preamble action training set is used for intercepting a section of 3D framework sequence containing the previous action in the original 3D framework sequence by changing the size of T, and then generating a tensor preamble action training set according to the data structure set by the invention. The constructed continuous motion data set generates a complete 3D skeleton sequence training set by all motion sequences in the whole video.

3) A high-quality action recognition guidance module:

a teacher network model needs to be built in the high-quality action recognition guidance module, and the teacher network model is built by selecting an STGCN network.

STGCN is formally expressed as follows:

where F represents the model output, A represents the adjacency matrix, I represents the unit matrix, F represents the input, and w represents the weight matrix, where Λ ⁱⁱ ＝Σ _j j(A ^ij +I ^ij )。

4) A low-quality robust motion recognition module:

the low-quality robust action recognition module needs to construct a student network model, and the STGCN network is also selected for the construction of the student network model. The low-quality robust motion recognition module applies a knowledge distillation method, and the knowledge distillation method is described in detail in a loss function of model training.

5) An online action identification module:

the line action identification module needs to construct an LSTM-STGCN model and is mainly constructed by using an STGCN network and an LSTM network. With particular regard to the LSTM-STGCN formalized expression, the following:

f _t ＝δ(W·[h _t-1 ,F _t ]+b) (2)

where δ represents the sigmod function, f _t For the output of LSTM-STGCN at the t-th frame, h _t-1 Output label, F for t-1 frame _t For the output label of the STGCN network at the t frame, W is the weight and b is the bias parameter.

6) The model training method comprises the following steps:

the invention mainly relates to the training of three models, namely the training of a teacher network model, the training of a student network model and the training of an online action recognition model. Each model was trained using a different strategy and three different training sets constructed. And the teacher network model is trained by using a training strategy of a single label and a training set of accurate segmentation. And (3) training the student network model, wherein the teacher network model is used for guiding the training of the student network model by using a knowledge distillation method. The training strategy is a single label training strategy, the training set uses a preceding action training set, and then a robust action recognition model is obtained. The training of the linear motion recognition model is guided by using a robust motion recognition model, and the training strategy is a sequence label classification method. The constructed continuous motion training set is used for training to obtain a continuous motion recognition model, and the model is a final model required by the invention and can realize online video human motion recognition and recognition.

7) Teacher network model loss function:

teacher network model training _oss ：

L _teacher ＝L _crossentroy (P _teacher ,Q) (3)

Wherein L is _crossentroy As a cross-entropy loss function, P _teacher And Q is label of the output of the teacher network model.

8) Student network model loss function:

and (4) training a student network model, and selecting a single label classification method for a training strategy. Different from the teacher network model, the knowledge distillation method is added for the training of the student network model, and simply speaking, the reasoning ability of the teacher network is transferred to the student network model. The specific implementation is mainly characterized in that the reasoning ability of the teacher model is transferred to the student model through the improved loss function. Entire student network

The total loss of the model is as follows:

L _total ＝αL _student +βL _kl +γL _mse (4)

L _student alpha, beta and gamma are three hyper-parameters set by experiments for loss terms of a student network model space-time graph convolutional network, the numerical values are mainly adjusted according to the effects in the experiments, and the initial values are all 1.

L _student ＝L _crossentroy (P _student ,Q) (5)

Wherein L is _crossentroy As a cross-entropy loss function, P _student Q is label for the output of the student network model.

L _kl To measure the loss terms of the teacher network model output and the student network model output:

L _kl ＝D(l _softmax (P _student )，l _softmax (P _teacher) )) (6)

wherein l _softmax (P _student ) Is the probability distribution of the student network model output by the softmax function, l _softmax (P _teacher) ) The probability distribution of the teacher network model output through the softmax function is shown, and D is the KL divergence function.

L _mse For measuring the extraction characteristics of the teacher network model and the extraction characteristic loss items of the student network model:

L _mse ＝l _mse (P _student ,P _teacher ) (7)

l _mse is a mean square loss function used for calculating the loss of the square of the extracted features of the student network model and the square of the extracted features of the teacher network model. P _student Output representing extracted features of a student network model, P _teacher And (4) output representing the extracted features of the teacher network model.

9) The LSTM-STGCN model is realized by the following steps:

the LSTM-STGCN model needs to be realized by fully utilizing historical information to realize continuous action identification and realize identification of each frame, so that the model is built by using the graph space-time convolution network and the LSTM network. The LSTM network is accessed to the full connection layer of the space-time graph convolutional network, the output of the STGCN is used as the input of the LSTM, and the LSTM network capable of identifying the action sequence is designed in the input and output module, so that the online action identification and recognition are realized.

The formalized expression of LSTM-STGCN is as follows:

f _t ＝δ(W·[h _t-1 ,F _t ]+b) (8)

10 LSTM-STGCN model loss function:

training of the LSTM-STGCN model loss:

L _LSTM-STGCN ＝L _crossentroy (P,Q) (9)L _LSTM-STGCN a loss term representing the model, wherein L _crossentroy For the cross entropy loss function, P is the output of the current model, and Q is label.

2. Experimental setup:

and (3) setting specific parameters of an experiment, and realizing motion recognition of the video sequence by sliding a window on the video sequence in the recognition process of the three constructed models. The larger the sliding window is, the larger the number of frames of the model extracted from the training data in the identification process is, for example, when the sliding window is set to 50, the larger the number of frames of the model extracted from the training data in the identification process is, the larger the number of frames of the model extracted from a data sample in the identification process is, the larger the sliding window is, the larger the number of frames of the model extracted from the training data in the identification process is, and the size of the sliding window becomes an important influence factor in the experimental part of the present invention. Regarding the size of the sliding window, the present invention mainly sets 4 values, 50, 100, 150, and 200.

3. Model test and result evaluation:

and (4) evaluating the results: the experimental part of the invention adopts two indexes for evaluating the experimental result: accuracy and average accuracy. In the experiment of the high-quality action representation guidance model and the experiment of the low-quality robust action recognition model, the evaluation index of the experiment result is the accuracy, and the experiment accuracy of the method is top1 precision. top1 precision mainly refers to probability output of an identified object, and if the maximum probability value in the output is a correct label, prediction is successful. The method comprises the steps of identifying each frame of a video during testing of an action identification model to obtain an identification result of each frame, then adding the identification accuracy of all the frames to divide the total number of the identified frames, and finally obtaining the average accuracy of the identification of the whole video. In the comparison of the LSTM-STGCN ablation experiment and the experiment of other existing methods, the average precision is adopted in the experiment to evaluate the result, the average precision can better evaluate the recognition and prediction capabilities of the model for the online actions, when the model only observes incomplete actions, namely a part of the actions, the model can give the accuracy to each frame of the video, and the recognition capabilities of the model can be objectively evaluated.

The final experimental results are given in the following table:

table 1: knowledge distillation-based robust motion feature extraction module experimental result table

Table 2: LSTM-STGCN ablation experiment result table

Model (model)	ST-LSTM	FSNet	SSNet	LSTM-STGCN
					Average accuracy	53.46％	53.96％	59.03％	62.37％

Table 3: results table of the present invention and other prior art methods

The online action recognition provided by the invention mainly refers to the recognition of human body actions in a video input sequence. Unlike traditional motion recognition, the online motion recognition method has the advantage of advanced prejudgment on human motion, namely, the model already recognizes the motion before the complete motion is not observed. The advantage can help decision makers to make analysis, early warning and the like, and the advantage has important practical application value and significance, for example, in the field of intelligent security monitoring, online action recognition can predict actions in advance.

Claims

1. An online action recognition method based on human body posture estimation and historical information comprises the following steps:

b) Constructing a high-quality action recognition guidance module: processing the original 3D skeleton data extracted in the step a) to construct an accurate action segmentation training set for a teacher network model

Selecting an STGCN network for construction of the teacher network model, training the teacher network by adopting a single-label training strategy, and finally obtaining a high-quality action representation guidance model;

c) Constructing a low-quality robust action recognition module: processing the 3D skeleton data of the original data set to generate a preceding action training set for the student network model

Training set

Each action fragment of (a) includes not only the current action but also the last action fragment before the action starts and the 3D skeleton data of the current action; selecting STGCN network to build student network model, and adopting single label training strategy in training set

Training, namely, using a knowledge distillation method to guide the training of a student network model by using the high-quality action representation guidance model generated in the step b), and then obtaining a robust action recognition model; the high-quality action representation guidance model is used for guiding the training of the student model, and the realization method is that for the construction of the loss function of the training model, the total loss function of the student network model

Wherein L is _kl In order to be a function of the divergence loss,

a loss term of the convolutional network of the student network model space-time diagram;

is a mean square loss function used for calculating the loss of the square of the extracted features of the student network model and the square of the extracted features of the teacher network model,

、

、

the three set hyper-parameters are trained on the data set in the front, and a robust action recognition model is obtained by using a single label training strategy;

d) Constructing an online action identification module:processing original 3D skeleton data to generate a new continuous motion training set for on-line motion recognition model training

Training set

The method comprises the steps that a segmented single action sequence is not needed, the segmented single action sequence comprises a continuous multi-segment action 3D framework sequence, then an STGCN and an LSTM network are used for building an LSTM-STGCN, an LSTM network is built in the STGCN to achieve the continuous sequence label classification effect, the LSTM network is connected to the back of a full connection layer of the STGCN, the output of the STGCN is used as the input of the LSTM, and a continuous action recognition model, namely an LSTM-STGCN model, is built; initializing an LSTM-STGCN model, taking the robust action recognition model obtained in the step c) as an LSTM-STGCN framework feature extraction module, loading parameters of the robust action recognition model into the LSTM-STGCN, training by using a constructed continuous action training set, and obtaining a final target model through a multi-label classification training strategy: an online action recognition model;

2. The on-line motion recognition method as claimed in claim 1, wherein in the step a), 3D skeleton data is extracted by using a Kinect v2 sensor pose estimation algorithm.

3. The on-line motion recognition method according to claim 1, wherein in the steps b), c) and D), a training set in a tensor form is constructed according to the coordinate data of the 3D skeleton, and the data structure of the training set is as follows:

wherein C represents 3 channels, T represents the frame number of data, V represents the number of skeleton joint points, M represents the number of characters in the video, and the constructed accurate motion segmentation training set

The method is divided according to action category labels, the 3D skeleton data of each action sequence only comprises one action category, and a constructed preamble action training set

By changing the size of T, a 3D skeleton sequence containing the last action is cut out from the original 3D skeleton sequence, and the former action training set

The number of frames of the motion sequence of (1) is T + B, B is a set number, and a tensor-form preceding motion training set is generated according to the data format

。