CN112633220B

CN112633220B - Human body posture estimation method based on bidirectional serialization modeling

Info

Publication number: CN112633220B
Application number: CN202011610311.1A
Authority: CN
Inventors: 刘振广; 封润洋; 陈豪明; 王勋; 钱鹏
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2024-01-09
Anticipated expiration: 2040-12-30
Also published as: CN112633220A

Abstract

The invention discloses a human body posture estimation method based on bidirectional serialization modeling, which takes continuous 3 frames as input, fully utilizes time sequence information of video to calculate the approximate spatial range of each joint, and then returns to the specific position of the joint from a smaller range, thereby better solving the problems of inherent occlusion, motion blur and the like in a human body posture estimation task, leading the generalization of a model to be stronger and having higher accuracy. The invention fully utilizes the time sequence information of the video, enhances the reasoning capability of the model, can better estimate the key parts of the human body, and has important significance in industries such as security protection, short video platform and the like needing to extract the gesture in real time for analysis.

Description

Human body posture estimation method based on bidirectional serialization modeling

Technical Field

The invention belongs to the technical field of human body posture estimation, and particularly relates to a human body posture estimation method based on bidirectional serialization modeling.

Background

Human body posture estimation is a leading edge research field in computer vision, and aims to locate key parts (such as wrists and ankles) of a human body in a picture or a video so as to realize human body posture estimation. The human body posture estimation is a bridge between a communication machine and a human body, has great practical significance, has been widely applied to various fields, such as the field of stage animation, and can generate a real-time interactive animation effect by recognizing the posture action of the human body; in the automatic driving field, traffic accidents can be avoided in advance by predicting the movement trend of pedestrians; in the security field, abnormal behaviors can be detected by identifying specific gesture sequences.

Currently, human body posture estimation methods are mainly divided into two categories: (1) Detecting all human body positions in the picture from top to bottom, and marking the human body by using a rectangular boundary frame; then the joints of each human body are identified through a human body joint position detector; and then mapping the cut figure posture information back to the original picture by utilizing affine transformation, thereby realizing all human body posture estimation in the picture. The top-down method separates the human position detection task from the human joint detection task and concentrates on the gesture estimation method, so that the method has higher accuracy, but the detection time spent is positively correlated with the human number in the picture, and the method needs to use a target detection technology, so that the detection quality of the position coordinates can directly influence the final result of the gesture estimation. (2) From bottom to top, firstly, joint position information of all human bodies in a picture is detected, and then joint coordinates belonging to the same person are clustered, so that posture estimation is carried out on all human bodies in the picture. The bottom-up method is high in efficiency, the detection time is less influenced by the number of people in the picture, and the accuracy is slightly behind.

The main human body posture estimation methods include network architectures designed for static pictures from top to bottom and from bottom to top, are good at human body posture estimation in single-frame pictures, and are usually used for video by decomposing the video into single frames and then performing posture estimation on each frame, however, these methods have great limitations, namely that only apparent information of a single picture can be captured. Generally, 1 frame, namely 1/25 second, is very short, so that the images between two frames of the video cannot be greatly changed, the similarity is very high, and due to the fact that abundant geometric consistency exists between adjacent frames of the video, the extra clues can be used for correcting key points which are difficult to predict, such as occlusion or motion blur.

The conventional picture-based pose estimation method cannot effectively utilize the additional information, so that the situation that the characters are entangled highly, blocked mutually, motion blurred and the like frequently occurring in the video sequence cannot be processed, and therefore, a good result is difficult to obtain in video pose estimation. To address this problem, document [ Flowing ConvNets for Human Pose Estimation in Videos- [ CODE ] -Pfister.T, charles.J & zisseman. A (ICCV 2015) ] proposes to calculate dense optical flow information between every two frames and then use the flow-based temporal information to correct the initial pose estimate; when the optical flow can be calculated correctly, the method achieves good effect, however, the calculation of the optical flow is greatly affected by picture quality, shielding and the like, all optical flow information cannot be calculated accurately in a video, and the calculation of the optical flow information often needs a great amount of calculation power to support. It has also been proposed by scholars to use Long Short-Term Memory (LSTM) to directly model video to capture timing information, however, due to the architecture limitation of LSTM networks, this method can only achieve good effects when the artifacts in the video frame are sparse, and still cannot handle the situations of occlusion, motion blur, etc. when used in complex scenes.

Disclosure of Invention

In view of the above, the invention provides a human body posture estimation method based on bidirectional serialization modeling, which takes continuous 3 frames as input, fully utilizes the time sequence information of video to calculate the approximate spatial range of each joint, and then returns to the specific position of the joint from a smaller range, thereby better solving the problems of shielding, motion blurring and the like inherent in the human body posture estimation task, leading the generalization of the model to be stronger and having higher accuracy.

A human body posture estimation method based on bidirectional serialization modeling comprises the following steps:

(1) Collecting and preprocessing a video data set for human body posture estimation;

(2) For a complete video in the video data set, taking continuous 3 frames of video images as a group of samples, and manually marking coordinates of each key part of a human body in the video images;

(3) Constructing a bidirectional continuous convolutional neural network, and training the convolutional neural network by using a large number of samples to obtain a human body posture estimation model;

(4) And inputting the video images to be estimated of the continuous 3 frames into the human body posture estimation model, and outputting to obtain the posture estimation result of the person in the video image of the 2 nd frame, namely the coordinates of each key part of the human body.

Further, in the step (1), for each frame of video image in the video dataset, the position coordinates of the human ROI (region of interest, i.e., person position bounding box) in the image are detected by YOLOv5 algorithm, and the ROI is enlarged by 25%.

Further, the bidirectional continuity convolutional neural network is composed of a Backbone network, a gesture time merging network, a gesture residual error merging network and a gesture correcting network, wherein the Backbone network is used for preliminarily calculating gesture feature vectors h of human bodies in three frames of video images of input samples _i-1 、h _i 、h _i+1 The three feature vectors are overlapped to obtain a vector phi (h) which is respectively input into a gesture time merging network and a gesture residual error fusion network, the gesture time merging network is used for encoding the approximate space range of each joint of a human body to obtain a feature vector zeta (h), the gesture residual error fusion network is used for calculating the gesture residual error vector phi (h) of the human body, and then the zeta (h) and the feature vector eta after the superposition of the phi (h) are input into a gesture correction network to be calculated to obtain a human body gesture prediction result.

Further, the gesture time merging network is formed by stacking three Residual blocks (Residual blocks), and a vector phi (h) is recombined according to a joint sequence and then is used as an input of the network to output a characteristic vector xi (h); the gesture residual fusion network is formed by stacking five residual blocks, gesture feature vectors of a second frame and a first frame and gesture feature vectors of a second frame and a third frame in samples are respectively differenced, tensors zeta are obtained through cascade (concatate) with weights and serve as input of the network, gesture residual vectors psi (h) are output, and the specific expression of the tensors zeta is as follows:

further, the residual block is formed by sequentially connecting a convolution layer with the size of 3×3, a batch normalization layer and a Relu activation layer, the residual block in the gesture time merging network adopts packet convolution, and the number of packets=17 (according to the key point standard of the COCO data set, 17 key points are all adopted); the residual blocks in the pose residual fusion network do not use packet convolution, the number of packets=1.

Further, the posture correction network is composed of five parallel deformable convolutions, the expansion rates of the five deformable convolutions are 3, 6, 9, 12 and 15 respectively, each deformable convolution takes a result obtained by stacking a characteristic vector xi (h) and eta as an input, a predicted Gaussian heat map is output, and the five Gaussian heat maps respectively output by the five convolutions are averaged to obtain a human posture prediction result.

Further, the training process of the bidirectional continuous convolutional neural network in the step (3) is divided into two steps: firstly, training a Backbone network, then fixing parameters of the Backbone network, and training a gesture time merging network, a gesture residual error fusion network and a gesture correcting network.

Further, the specific process of training the backhaul network is as follows: inputting human body ROIs in all video images of a sample into a back bone network one by one, calculating a loss function L1 between a human body posture prediction result output by the whole bidirectional continuity convolutional neural network and artificial mark information corresponding to the sample, repeatedly updating back bone network parameters through back propagation according to the loss function L1 until the loss function L1 converges, wherein the expression of the loss function L1 is as follows:

wherein: n is the number of marked key parts of the human body, H _{gt_i} Converting coordinates of the ith key part artificial mark of all human ROIs in a group of samples to generate a Gaussian heat map superimposed result, H _{pred_i} Converting coordinates of ith key part of all human ROIs in a group of samples, which are predicted and output by a bidirectional continuous convolutional neural network, to generate a result obtained by overlapping Gaussian heat maps，‖ ‖ ₂ Represents the L2 norm, v _i Indicating whether the ith key part has a label in the sample image, if so, the value is 1, otherwise, the value is 0.

Further, the specific processes of training the gesture time merging network, the gesture residual error merging network and the gesture correcting network are as follows: firstly, fixing trained parameters of a back bone network, then inputting human body ROIs in all video images of a sample into the back bone network one by one, calculating a loss function L2 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial mark information corresponding to the sample, repeatedly updating parameters of a posture time merging network, a posture residual error merging network and a posture correcting network through back propagation according to the loss function L2 until the loss function L2 converges, wherein the expression of the loss function L2 is as follows:

wherein: n is the number of marked key parts of the human body, G _{gt_i} A Gaussian heat map, G, generated by converting coordinates of an ith key part artificial mark of a human body ROI in a 2 nd frame video image of a group of samples _{pred_i} The method comprises the steps that a Gaussian heat map generated by converting the output coordinates of the ith key part of a human body ROI in a 2 nd frame video image of a group of samples through bidirectional continuous convolutional neural network prediction is II ₂ Represents the L2 norm, v _i Indicating whether the ith key part has a label in the sample image, if so, the value is 1, otherwise, the value is 0.

Further, the specific implementation process of the step (4) is as follows: inputting human body ROI of the same person in the continuous 3 frames of video images to be estimated into a human body posture estimation model, outputting to obtain a Gaussian heat map, converting and calculating the Gaussian heat map to obtain key position coordinate information of the same person in the 2 nd frame of video images, mapping the coordinate information into the 2 nd frame of video images, and linking the key positions in sequence to generate a prediction result of the human body skeleton, thereby realizing human body posture estimation.

The human body posture estimation method based on the bidirectional continuity mainly uses deformable convolution networks with different void ratios as a prediction model; the deformable convolutional network is a variant of the traditional convolutional neural network, the convolutional kernels of the traditional convolutional neural network are square, common objects such as human bodies and the like are not square, the traditional convolutional network has certain limitation, and the deformable convolutional network can obtain the convolutional kernels with any shape by learning the offset parameters of each pixel of the convolutional kernels so as to better adapt to the objects with various shapes; different empty rates are adopted by each convolution layer to correspond to different receptive fields, the bigger the empty rate is, the bigger the empty rate corresponds to the larger receptive field, the information biased to the global can be captured, otherwise, the finer local information can be captured by the smaller empty rate; thus, the design of the deformable convolution network is more reasonable for human body pose estimation in video.

The invention fully utilizes the time sequence information of the video, enhances the reasoning capability of the model, can better estimate the key parts of the human body, has important significance in the industries of security protection, short video platform and the like needing to extract the gesture in real time for analysis, and has the following two main advantages:

1. the invention can better estimate the key points which are blocked and have motion blur through an accurate gesture estimation algorithm, and has the characteristics of more accurate and faster detection.

2. Aiming at video design, the method and the device more accord with various application scenes, and meanwhile adopt grouping convolution, cavity convolution and the like, and obtain better effect with less parameter quantity, so that the gesture estimation can be applied in real time.

Drawings

Fig. 1 is a flow chart of a human body posture estimation method of the present invention.

FIG. 2 is a schematic diagram of the Residual Block structure and its stacking scheme.

FIG. 3 is a schematic diagram of a two-way continuous convolutional neural network of the present invention.

Detailed Description

In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.

As shown in fig. 1, the human body posture estimation method based on bidirectional continuity of the invention comprises the following steps:

(1) And collecting and selecting a human body posture estimation video data set, and preprocessing the data set.

In this embodiment, the training data adopts a poiset track data set, and the data set is used for a human body posture tracking task, wherein many videos have conditions of person shielding and motion blurring, which greatly increases the difficulty of estimating the human body posture of the videos. This embodiment is a top-down approach, and therefore requires pretreatment of the data set: firstly, detecting the position boundary box of each person of a frame to be estimated through a YOLO V5 detection algorithm, and amplifying each boundary box by 25% to cut out the front and rear frames to obtain three-frame images of the same person.

(2) And constructing a bidirectional continuous convolutional neural network model as a human body posture estimation model.

As shown in fig. 3, the bidirectional continuous convolutional neural network (dcpmose) mainly consists of the following parts: a backhaul network module, a gesture time merging module PTM, a gesture residual error fusion module PRF and a gesture correcting network module PCN. In this embodiment, the backhaul network module uses a high-resolution network HRNet to primarily calculate the character gestures in the three input pictures to obtain a feature vector h _i-1 、h _i 、h _i+1 The three vectors are overlapped to obtain a vector phi (h), and two parallel branches are input; the gesture time merging module encodes a rough spatial range zeta (h) of each joint, the gesture residual error merging module obtains a gesture residual error vector phi (h), and then the characteristic vector zeta (h), the characteristic vector eta overlapped by the characteristic vector zeta (h) and the phi (h) are input into a gesture correction network to obtain a final gesture prediction result.

The gesture time merging module consists of three stacked Residual blocks (Residual blocks), a group of samples pass through a Backbone network to obtain a characteristic vector phi (h), and the characteristic vector phi (h) is input as the module according to joint sequence recombination and output; where each residual block uses a block convolution, the parameter groups=17 (there are 17 keypoints in total per the keypoint criteria of the COCO dataset).

The gesture residual fusion module is composed of five stacked residual blocks, firstly gesture feature vectors of a first frame, a second frame, a third frame and a second frame of the group of samples are respectively differenced, and tensor zeta is obtained through cascade connection with weights and is used as input of the module, gesture residual vectors psi (h) are output, wherein the tensor zeta can be formed as:

as shown in fig. 2, the residual block consists of a 3*3 convolution layer, a batch normalization layer, a Relu activation layer; the gesture time merging module and the gesture residual error merging module are directly formed by cascading a plurality of residual error blocks, and the difference is that the group parameter in three residual error block convolution layers forming the PTM module is 17, the corresponding PRF module does not use group convolution, and the group parameter in the convolution layer is 1.

The posture correction network consists of five parallel deformable convolutions, and the expansion rates are respectively set as follows: 3. 6, 9, 12 and 15, each deformable convolution takes the stack of eigenvectors xi (h) and eta as input, outputs a predicted Gaussian heat map, and finally averages the five heat maps to obtain a final prediction result.

(3) Inputting the data preprocessed in the step (1) into a model, and updating parameters and training the model by taking the L distance as a loss function.

The DCPOse adopts a method of independent training, firstly trains the backhaul network, then fixes the backhaul network and trains other partial networks.

The DCPOse takes each frame of the video as the current frame to be estimated, and takes one frame forward and backward to divide the video into a plurality of sub-picture sequences, wherein the length of each sub-picture sequence is 3, and all human body key point position label information exists in each sub-picture sequence, and then takes each divided sub-picture sequence as the input of the DCPOse.

Firstly, loading official pre-training model parameters by a Backbone network, then inputting a group of sub-picture sequences, outputting attitude feature vectors, and calculating a mean square error with the real attitude vectors to obtain a loss value of each frame, wherein the expression of a loss function L is as follows:

wherein: h _{gt_i} H is the result of superposition of Gaussian heat maps generated by conversion of real coordinates of ith key part of all people in subsequences _{pred_i} The result of the superposition of the Gaussian heat map generated for the coordinate transformation predicted by the ith key part of all people in the subsequence is II ₂ Represents L2 norm, N is the number of key parts marked by human body, v _i Indicating whether the coordinate has a label, if so, taking a value of 1, otherwise, taking a value of 0.

After the back bone network training is finished, fixing parameters of the back bone network training, inputting each sub-picture sequence into a DCPOSE network, and obtaining a posture feature vector phi (h) through the back bone network, wherein the dimension is [4,51,96,72]; then inputting a PTM network to obtain a feature vector xi (h), wherein the dimension is [4,17,96,72], inputting a PRF network to obtain a feature vector ψ (h), and the dimension is [4,128,96,72]; then, the feature vector xi (h) and the superimposed vectors eta of the feature vectors xi (h) and phi (h) are input into the PCN network together, wherein the dimension is [4,145,96,72], and each deformable convolution layer outputs a gesture feature vector, and the dimension is [4,17,96,72]; and 5 different gesture feature vectors are averaged to obtain a final Gaussian heat map.

During DCPOse training, L2 Loss is mainly adopted, and in each picture sequence input into the bidirectional continuity convolutional neural network, the 2 nd frame is really needed to be estimated, so that the Loss value cannot be calculated through the 1 st frame and the 3 rd frame; the 2 nd frame loss function calculation is basically the same as the loss function during the training of the backhaul network, and the only difference is that H _{gt_i} H is the result of Gaussian heat map generated by converting the real coordinates of the ith key part of the person in the sample 2 nd frame _{pred_i} A result of a Gaussian heat map generated by coordinate conversion of the i key parts of the person in the 2 nd frame; by fully utilizing the two-way information of the front frame and the rear frame, the network has more accurate prediction capability.

(4) After model training is completed, a test set is input, a human body posture estimation result is output, and the specific implementation process is as follows:

4.1 inputting the test set into the trained model to obtain a Gaussian heat map of each frame.

4.2, calculating the final Gaussian heat map in the step 4.1 through a Gaussian heat map coordinate conversion algorithm to obtain coordinate information of key parts of the human body, mapping the coordinate information into an original picture to obtain the positions of the key parts, and finally linking the key parts in sequence to generate a prediction result of a human body skeleton so as to achieve the aim of estimating the human body posture.

The previous description of the embodiments is provided to facilitate a person of ordinary skill in the art in order to make and use the present invention. It will be apparent to those having ordinary skill in the art that various modifications to the above-described embodiments may be readily made and the generic principles described herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above-described embodiments, and those skilled in the art, based on the present disclosure, should make improvements and modifications within the scope of the present invention.

Claims

1. A human body posture estimation method based on bidirectional serialization modeling comprises the following steps:

the bidirectional continuity convolutional neural network consists of a Backbone network, a gesture time merging network, a gesture residual error merging network and a gesture correcting network, wherein the Backbone network is used for preliminarily calculating human bodies in three frames of video images of input samplesAttitude feature vector h _i-1 、h _i 、h _i+1 The three feature vectors are overlapped to obtain a vector phi (h) which is respectively input into a gesture time merging network and a gesture residual error fusion network, the gesture time merging network is used for encoding the approximate space range of each joint of a human body to obtain a feature vector zeta (h), the gesture residual error fusion network is used for calculating the gesture residual error vector phi (h) of the human body, and then the zeta (h) and the feature vector eta after the superposition of the phi (h) are input into a gesture correction network to be calculated to obtain a human body gesture prediction result;

the attitude time merging network is formed by stacking three residual blocks, and a vector phi (h) is recombined according to the joint sequence and then is used as the input of the network to output a characteristic vector xi (h); the gesture residual fusion network is formed by stacking five residual blocks, gesture feature vectors of a second frame and a first frame and gesture feature vectors of a second frame and a third frame in samples are respectively differenced, tensors zeta are obtained through cascade connection with weights and serve as input of the network, gesture residual vectors psi (h) are output, and the specific expression of the tensors zeta is as follows:

the residual block is formed by sequentially connecting a convolution layer with the size of 3 multiplied by 3, a batch normalization layer and a Relu activation layer, wherein the residual block in the gesture time merging network adopts grouping convolution, and the grouping number groups=17; the residual blocks in the gesture residual fusion network do not use packet convolution, and the packet number groups=1;

the gesture correction network consists of five parallel deformable convolutions, the expansion rates of the five deformable convolutions are 3, 6, 9, 12 and 15 respectively, each deformable convolution takes the result obtained by stacking a characteristic vector xi (h) and eta as input, a predicted Gaussian heat map is output, and the five Gaussian heat maps respectively output by the five convolutions are averaged to obtain a human gesture prediction result;

2. The human body posture estimation method according to claim 1, characterized in that: for each frame of video image in the video data set in the step (1), detecting the position coordinates of the human body ROI in the image by a YOLOv5 algorithm, and amplifying the ROI by 25%.

3. The human body posture estimation method according to claim 1, characterized in that: the training process of the bidirectional continuity convolutional neural network in the step (3) is divided into two steps: firstly, training a Backbone network, then fixing parameters of the Backbone network, and training a gesture time merging network, a gesture residual error fusion network and a gesture correcting network.

4. A human body posture estimation method according to claim 3, characterized in that: the specific process of training the backhaul network is as follows: inputting human body ROIs in all video images of a sample into a back bone network one by one, calculating a loss function L1 between a human body posture prediction result output by the whole bidirectional continuity convolutional neural network and artificial mark information corresponding to the sample, repeatedly updating back bone network parameters through back propagation according to the loss function L1 until the loss function L1 converges, wherein the expression of the loss function L1 is as follows:

wherein: n is the number of marked key parts of the human body, H _{gt_i} Converting coordinates of the ith key part artificial mark of all human ROIs in a group of samples to generate a Gaussian heat map superimposed result, H _{pred_i} Converting coordinates of ith key part of all human ROIs in a group of samples, which are predicted and output by a bidirectional continuous convolutional neural network, into a result obtained by overlapping Gaussian heat maps, wherein II is II ₂ Represents the L2 norm, v _i Indicating whether the ith key part has a label in the sample image, and if so, taking the value as1, otherwise, the value is 0.

5. A human body posture estimation method according to claim 3, characterized in that: the specific processes of the training gesture time merging network, the gesture residual error merging network and the gesture correcting network are as follows: firstly, fixing trained parameters of a back bone network, then inputting human body ROIs in all video images of a sample into the back bone network one by one, calculating a loss function L2 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial mark information corresponding to the sample, repeatedly updating parameters of a posture time merging network, a posture residual error merging network and a posture correcting network through back propagation according to the loss function L2 until the loss function L2 converges, wherein the expression of the loss function L2 is as follows:

6. The human body posture estimation method according to claim 1, characterized in that: the specific implementation process of the step (4) is as follows: inputting human body ROI of the same person in the continuous 3 frames of video images to be estimated into a human body posture estimation model, outputting to obtain a Gaussian heat map, converting and calculating the Gaussian heat map to obtain key position coordinate information of the same person in the 2 nd frame of video images, mapping the coordinate information into the 2 nd frame of video images, and linking the key positions in sequence to generate a prediction result of the human body skeleton, thereby realizing human body posture estimation.