CN112633220B - Human body posture estimation method based on bidirectional serialization modeling - Google Patents

Human body posture estimation method based on bidirectional serialization modeling Download PDF

Info

Publication number
CN112633220B
CN112633220B CN202011610311.1A CN202011610311A CN112633220B CN 112633220 B CN112633220 B CN 112633220B CN 202011610311 A CN202011610311 A CN 202011610311A CN 112633220 B CN112633220 B CN 112633220B
Authority
CN
China
Prior art keywords
network
human body
gesture
posture estimation
body posture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011610311.1A
Other languages
Chinese (zh)
Other versions
CN112633220A (en
Inventor
刘振广
封润洋
陈豪明
王勋
钱鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN202011610311.1A priority Critical patent/CN112633220B/en
Publication of CN112633220A publication Critical patent/CN112633220A/en
Application granted granted Critical
Publication of CN112633220B publication Critical patent/CN112633220B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a human body posture estimation method based on bidirectional serialization modeling, which takes continuous 3 frames as input, fully utilizes time sequence information of video to calculate the approximate spatial range of each joint, and then returns to the specific position of the joint from a smaller range, thereby better solving the problems of inherent occlusion, motion blur and the like in a human body posture estimation task, leading the generalization of a model to be stronger and having higher accuracy. The invention fully utilizes the time sequence information of the video, enhances the reasoning capability of the model, can better estimate the key parts of the human body, and has important significance in industries such as security protection, short video platform and the like needing to extract the gesture in real time for analysis.

Description

Human body posture estimation method based on bidirectional serialization modeling
Technical Field
The invention belongs to the technical field of human body posture estimation, and particularly relates to a human body posture estimation method based on bidirectional serialization modeling.
Background
Human body posture estimation is a leading edge research field in computer vision, and aims to locate key parts (such as wrists and ankles) of a human body in a picture or a video so as to realize human body posture estimation. The human body posture estimation is a bridge between a communication machine and a human body, has great practical significance, has been widely applied to various fields, such as the field of stage animation, and can generate a real-time interactive animation effect by recognizing the posture action of the human body; in the automatic driving field, traffic accidents can be avoided in advance by predicting the movement trend of pedestrians; in the security field, abnormal behaviors can be detected by identifying specific gesture sequences.
Currently, human body posture estimation methods are mainly divided into two categories: (1) Detecting all human body positions in the picture from top to bottom, and marking the human body by using a rectangular boundary frame; then the joints of each human body are identified through a human body joint position detector; and then mapping the cut figure posture information back to the original picture by utilizing affine transformation, thereby realizing all human body posture estimation in the picture. The top-down method separates the human position detection task from the human joint detection task and concentrates on the gesture estimation method, so that the method has higher accuracy, but the detection time spent is positively correlated with the human number in the picture, and the method needs to use a target detection technology, so that the detection quality of the position coordinates can directly influence the final result of the gesture estimation. (2) From bottom to top, firstly, joint position information of all human bodies in a picture is detected, and then joint coordinates belonging to the same person are clustered, so that posture estimation is carried out on all human bodies in the picture. The bottom-up method is high in efficiency, the detection time is less influenced by the number of people in the picture, and the accuracy is slightly behind.
The main human body posture estimation methods include network architectures designed for static pictures from top to bottom and from bottom to top, are good at human body posture estimation in single-frame pictures, and are usually used for video by decomposing the video into single frames and then performing posture estimation on each frame, however, these methods have great limitations, namely that only apparent information of a single picture can be captured. Generally, 1 frame, namely 1/25 second, is very short, so that the images between two frames of the video cannot be greatly changed, the similarity is very high, and due to the fact that abundant geometric consistency exists between adjacent frames of the video, the extra clues can be used for correcting key points which are difficult to predict, such as occlusion or motion blur.
The conventional picture-based pose estimation method cannot effectively utilize the additional information, so that the situation that the characters are entangled highly, blocked mutually, motion blurred and the like frequently occurring in the video sequence cannot be processed, and therefore, a good result is difficult to obtain in video pose estimation. To address this problem, document [ Flowing ConvNets for Human Pose Estimation in Videos- [ CODE ] -Pfister.T, charles.J & zisseman. A (ICCV 2015) ] proposes to calculate dense optical flow information between every two frames and then use the flow-based temporal information to correct the initial pose estimate; when the optical flow can be calculated correctly, the method achieves good effect, however, the calculation of the optical flow is greatly affected by picture quality, shielding and the like, all optical flow information cannot be calculated accurately in a video, and the calculation of the optical flow information often needs a great amount of calculation power to support. It has also been proposed by scholars to use Long Short-Term Memory (LSTM) to directly model video to capture timing information, however, due to the architecture limitation of LSTM networks, this method can only achieve good effects when the artifacts in the video frame are sparse, and still cannot handle the situations of occlusion, motion blur, etc. when used in complex scenes.
Disclosure of Invention
In view of the above, the invention provides a human body posture estimation method based on bidirectional serialization modeling, which takes continuous 3 frames as input, fully utilizes the time sequence information of video to calculate the approximate spatial range of each joint, and then returns to the specific position of the joint from a smaller range, thereby better solving the problems of shielding, motion blurring and the like inherent in the human body posture estimation task, leading the generalization of the model to be stronger and having higher accuracy.
A human body posture estimation method based on bidirectional serialization modeling comprises the following steps:
(1) Collecting and preprocessing a video data set for human body posture estimation;
(2) For a complete video in the video data set, taking continuous 3 frames of video images as a group of samples, and manually marking coordinates of each key part of a human body in the video images;
(3) Constructing a bidirectional continuous convolutional neural network, and training the convolutional neural network by using a large number of samples to obtain a human body posture estimation model;
(4) And inputting the video images to be estimated of the continuous 3 frames into the human body posture estimation model, and outputting to obtain the posture estimation result of the person in the video image of the 2 nd frame, namely the coordinates of each key part of the human body.
Further, in the step (1), for each frame of video image in the video dataset, the position coordinates of the human ROI (region of interest, i.e., person position bounding box) in the image are detected by YOLOv5 algorithm, and the ROI is enlarged by 25%.
Further, the bidirectional continuity convolutional neural network is composed of a Backbone network, a gesture time merging network, a gesture residual error merging network and a gesture correcting network, wherein the Backbone network is used for preliminarily calculating gesture feature vectors h of human bodies in three frames of video images of input samples i-1 、h i 、h i+1 The three feature vectors are overlapped to obtain a vector phi (h) which is respectively input into a gesture time merging network and a gesture residual error fusion network, the gesture time merging network is used for encoding the approximate space range of each joint of a human body to obtain a feature vector zeta (h), the gesture residual error fusion network is used for calculating the gesture residual error vector phi (h) of the human body, and then the zeta (h) and the feature vector eta after the superposition of the phi (h) are input into a gesture correction network to be calculated to obtain a human body gesture prediction result.
Further, the gesture time merging network is formed by stacking three Residual blocks (Residual blocks), and a vector phi (h) is recombined according to a joint sequence and then is used as an input of the network to output a characteristic vector xi (h); the gesture residual fusion network is formed by stacking five residual blocks, gesture feature vectors of a second frame and a first frame and gesture feature vectors of a second frame and a third frame in samples are respectively differenced, tensors zeta are obtained through cascade (concatate) with weights and serve as input of the network, gesture residual vectors psi (h) are output, and the specific expression of the tensors zeta is as follows:
further, the residual block is formed by sequentially connecting a convolution layer with the size of 3×3, a batch normalization layer and a Relu activation layer, the residual block in the gesture time merging network adopts packet convolution, and the number of packets=17 (according to the key point standard of the COCO data set, 17 key points are all adopted); the residual blocks in the pose residual fusion network do not use packet convolution, the number of packets=1.
Further, the posture correction network is composed of five parallel deformable convolutions, the expansion rates of the five deformable convolutions are 3, 6, 9, 12 and 15 respectively, each deformable convolution takes a result obtained by stacking a characteristic vector xi (h) and eta as an input, a predicted Gaussian heat map is output, and the five Gaussian heat maps respectively output by the five convolutions are averaged to obtain a human posture prediction result.
Further, the training process of the bidirectional continuous convolutional neural network in the step (3) is divided into two steps: firstly, training a Backbone network, then fixing parameters of the Backbone network, and training a gesture time merging network, a gesture residual error fusion network and a gesture correcting network.
Further, the specific process of training the backhaul network is as follows: inputting human body ROIs in all video images of a sample into a back bone network one by one, calculating a loss function L1 between a human body posture prediction result output by the whole bidirectional continuity convolutional neural network and artificial mark information corresponding to the sample, repeatedly updating back bone network parameters through back propagation according to the loss function L1 until the loss function L1 converges, wherein the expression of the loss function L1 is as follows:
wherein: n is the number of marked key parts of the human body, H gt_i Converting coordinates of the ith key part artificial mark of all human ROIs in a group of samples to generate a Gaussian heat map superimposed result, H pred_i Converting coordinates of ith key part of all human ROIs in a group of samples, which are predicted and output by a bidirectional continuous convolutional neural network, to generate a result obtained by overlapping Gaussian heat maps,‖ ‖ 2 Represents the L2 norm, v i Indicating whether the ith key part has a label in the sample image, if so, the value is 1, otherwise, the value is 0.
Further, the specific processes of training the gesture time merging network, the gesture residual error merging network and the gesture correcting network are as follows: firstly, fixing trained parameters of a back bone network, then inputting human body ROIs in all video images of a sample into the back bone network one by one, calculating a loss function L2 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial mark information corresponding to the sample, repeatedly updating parameters of a posture time merging network, a posture residual error merging network and a posture correcting network through back propagation according to the loss function L2 until the loss function L2 converges, wherein the expression of the loss function L2 is as follows:
wherein: n is the number of marked key parts of the human body, G gt_i A Gaussian heat map, G, generated by converting coordinates of an ith key part artificial mark of a human body ROI in a 2 nd frame video image of a group of samples pred_i The method comprises the steps that a Gaussian heat map generated by converting the output coordinates of the ith key part of a human body ROI in a 2 nd frame video image of a group of samples through bidirectional continuous convolutional neural network prediction is II 2 Represents the L2 norm, v i Indicating whether the ith key part has a label in the sample image, if so, the value is 1, otherwise, the value is 0.
Further, the specific implementation process of the step (4) is as follows: inputting human body ROI of the same person in the continuous 3 frames of video images to be estimated into a human body posture estimation model, outputting to obtain a Gaussian heat map, converting and calculating the Gaussian heat map to obtain key position coordinate information of the same person in the 2 nd frame of video images, mapping the coordinate information into the 2 nd frame of video images, and linking the key positions in sequence to generate a prediction result of the human body skeleton, thereby realizing human body posture estimation.
The human body posture estimation method based on the bidirectional continuity mainly uses deformable convolution networks with different void ratios as a prediction model; the deformable convolutional network is a variant of the traditional convolutional neural network, the convolutional kernels of the traditional convolutional neural network are square, common objects such as human bodies and the like are not square, the traditional convolutional network has certain limitation, and the deformable convolutional network can obtain the convolutional kernels with any shape by learning the offset parameters of each pixel of the convolutional kernels so as to better adapt to the objects with various shapes; different empty rates are adopted by each convolution layer to correspond to different receptive fields, the bigger the empty rate is, the bigger the empty rate corresponds to the larger receptive field, the information biased to the global can be captured, otherwise, the finer local information can be captured by the smaller empty rate; thus, the design of the deformable convolution network is more reasonable for human body pose estimation in video.
The invention fully utilizes the time sequence information of the video, enhances the reasoning capability of the model, can better estimate the key parts of the human body, has important significance in the industries of security protection, short video platform and the like needing to extract the gesture in real time for analysis, and has the following two main advantages:
1. the invention can better estimate the key points which are blocked and have motion blur through an accurate gesture estimation algorithm, and has the characteristics of more accurate and faster detection.
2. Aiming at video design, the method and the device more accord with various application scenes, and meanwhile adopt grouping convolution, cavity convolution and the like, and obtain better effect with less parameter quantity, so that the gesture estimation can be applied in real time.
Drawings
Fig. 1 is a flow chart of a human body posture estimation method of the present invention.
FIG. 2 is a schematic diagram of the Residual Block structure and its stacking scheme.
FIG. 3 is a schematic diagram of a two-way continuous convolutional neural network of the present invention.
Detailed Description
In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.
As shown in fig. 1, the human body posture estimation method based on bidirectional continuity of the invention comprises the following steps:
(1) And collecting and selecting a human body posture estimation video data set, and preprocessing the data set.
In this embodiment, the training data adopts a poiset track data set, and the data set is used for a human body posture tracking task, wherein many videos have conditions of person shielding and motion blurring, which greatly increases the difficulty of estimating the human body posture of the videos. This embodiment is a top-down approach, and therefore requires pretreatment of the data set: firstly, detecting the position boundary box of each person of a frame to be estimated through a YOLO V5 detection algorithm, and amplifying each boundary box by 25% to cut out the front and rear frames to obtain three-frame images of the same person.
(2) And constructing a bidirectional continuous convolutional neural network model as a human body posture estimation model.
As shown in fig. 3, the bidirectional continuous convolutional neural network (dcpmose) mainly consists of the following parts: a backhaul network module, a gesture time merging module PTM, a gesture residual error fusion module PRF and a gesture correcting network module PCN. In this embodiment, the backhaul network module uses a high-resolution network HRNet to primarily calculate the character gestures in the three input pictures to obtain a feature vector h i-1 、h i 、h i+1 The three vectors are overlapped to obtain a vector phi (h), and two parallel branches are input; the gesture time merging module encodes a rough spatial range zeta (h) of each joint, the gesture residual error merging module obtains a gesture residual error vector phi (h), and then the characteristic vector zeta (h), the characteristic vector eta overlapped by the characteristic vector zeta (h) and the phi (h) are input into a gesture correction network to obtain a final gesture prediction result.
The gesture time merging module consists of three stacked Residual blocks (Residual blocks), a group of samples pass through a Backbone network to obtain a characteristic vector phi (h), and the characteristic vector phi (h) is input as the module according to joint sequence recombination and output; where each residual block uses a block convolution, the parameter groups=17 (there are 17 keypoints in total per the keypoint criteria of the COCO dataset).
The gesture residual fusion module is composed of five stacked residual blocks, firstly gesture feature vectors of a first frame, a second frame, a third frame and a second frame of the group of samples are respectively differenced, and tensor zeta is obtained through cascade connection with weights and is used as input of the module, gesture residual vectors psi (h) are output, wherein the tensor zeta can be formed as:
as shown in fig. 2, the residual block consists of a 3*3 convolution layer, a batch normalization layer, a Relu activation layer; the gesture time merging module and the gesture residual error merging module are directly formed by cascading a plurality of residual error blocks, and the difference is that the group parameter in three residual error block convolution layers forming the PTM module is 17, the corresponding PRF module does not use group convolution, and the group parameter in the convolution layer is 1.
The posture correction network consists of five parallel deformable convolutions, and the expansion rates are respectively set as follows: 3. 6, 9, 12 and 15, each deformable convolution takes the stack of eigenvectors xi (h) and eta as input, outputs a predicted Gaussian heat map, and finally averages the five heat maps to obtain a final prediction result.
(3) Inputting the data preprocessed in the step (1) into a model, and updating parameters and training the model by taking the L distance as a loss function.
The DCPOse adopts a method of independent training, firstly trains the backhaul network, then fixes the backhaul network and trains other partial networks.
The DCPOse takes each frame of the video as the current frame to be estimated, and takes one frame forward and backward to divide the video into a plurality of sub-picture sequences, wherein the length of each sub-picture sequence is 3, and all human body key point position label information exists in each sub-picture sequence, and then takes each divided sub-picture sequence as the input of the DCPOse.
Firstly, loading official pre-training model parameters by a Backbone network, then inputting a group of sub-picture sequences, outputting attitude feature vectors, and calculating a mean square error with the real attitude vectors to obtain a loss value of each frame, wherein the expression of a loss function L is as follows:
wherein: h gt_i H is the result of superposition of Gaussian heat maps generated by conversion of real coordinates of ith key part of all people in subsequences pred_i The result of the superposition of the Gaussian heat map generated for the coordinate transformation predicted by the ith key part of all people in the subsequence is II 2 Represents L2 norm, N is the number of key parts marked by human body, v i Indicating whether the coordinate has a label, if so, taking a value of 1, otherwise, taking a value of 0.
After the back bone network training is finished, fixing parameters of the back bone network training, inputting each sub-picture sequence into a DCPOSE network, and obtaining a posture feature vector phi (h) through the back bone network, wherein the dimension is [4,51,96,72]; then inputting a PTM network to obtain a feature vector xi (h), wherein the dimension is [4,17,96,72], inputting a PRF network to obtain a feature vector ψ (h), and the dimension is [4,128,96,72]; then, the feature vector xi (h) and the superimposed vectors eta of the feature vectors xi (h) and phi (h) are input into the PCN network together, wherein the dimension is [4,145,96,72], and each deformable convolution layer outputs a gesture feature vector, and the dimension is [4,17,96,72]; and 5 different gesture feature vectors are averaged to obtain a final Gaussian heat map.
During DCPOse training, L2 Loss is mainly adopted, and in each picture sequence input into the bidirectional continuity convolutional neural network, the 2 nd frame is really needed to be estimated, so that the Loss value cannot be calculated through the 1 st frame and the 3 rd frame; the 2 nd frame loss function calculation is basically the same as the loss function during the training of the backhaul network, and the only difference is that H gt_i H is the result of Gaussian heat map generated by converting the real coordinates of the ith key part of the person in the sample 2 nd frame pred_i A result of a Gaussian heat map generated by coordinate conversion of the i key parts of the person in the 2 nd frame; by fully utilizing the two-way information of the front frame and the rear frame, the network has more accurate prediction capability.
(4) After model training is completed, a test set is input, a human body posture estimation result is output, and the specific implementation process is as follows:
4.1 inputting the test set into the trained model to obtain a Gaussian heat map of each frame.
4.2, calculating the final Gaussian heat map in the step 4.1 through a Gaussian heat map coordinate conversion algorithm to obtain coordinate information of key parts of the human body, mapping the coordinate information into an original picture to obtain the positions of the key parts, and finally linking the key parts in sequence to generate a prediction result of a human body skeleton so as to achieve the aim of estimating the human body posture.
The previous description of the embodiments is provided to facilitate a person of ordinary skill in the art in order to make and use the present invention. It will be apparent to those having ordinary skill in the art that various modifications to the above-described embodiments may be readily made and the generic principles described herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above-described embodiments, and those skilled in the art, based on the present disclosure, should make improvements and modifications within the scope of the present invention.

Claims (6)

1. A human body posture estimation method based on bidirectional serialization modeling comprises the following steps:
(1) Collecting and preprocessing a video data set for human body posture estimation;
(2) For a complete video in the video data set, taking continuous 3 frames of video images as a group of samples, and manually marking coordinates of each key part of a human body in the video images;
(3) Constructing a bidirectional continuous convolutional neural network, and training the convolutional neural network by using a large number of samples to obtain a human body posture estimation model;
the bidirectional continuity convolutional neural network consists of a Backbone network, a gesture time merging network, a gesture residual error merging network and a gesture correcting network, wherein the Backbone network is used for preliminarily calculating human bodies in three frames of video images of input samplesAttitude feature vector h i-1 、h i 、h i+1 The three feature vectors are overlapped to obtain a vector phi (h) which is respectively input into a gesture time merging network and a gesture residual error fusion network, the gesture time merging network is used for encoding the approximate space range of each joint of a human body to obtain a feature vector zeta (h), the gesture residual error fusion network is used for calculating the gesture residual error vector phi (h) of the human body, and then the zeta (h) and the feature vector eta after the superposition of the phi (h) are input into a gesture correction network to be calculated to obtain a human body gesture prediction result;
the attitude time merging network is formed by stacking three residual blocks, and a vector phi (h) is recombined according to the joint sequence and then is used as the input of the network to output a characteristic vector xi (h); the gesture residual fusion network is formed by stacking five residual blocks, gesture feature vectors of a second frame and a first frame and gesture feature vectors of a second frame and a third frame in samples are respectively differenced, tensors zeta are obtained through cascade connection with weights and serve as input of the network, gesture residual vectors psi (h) are output, and the specific expression of the tensors zeta is as follows:
the residual block is formed by sequentially connecting a convolution layer with the size of 3 multiplied by 3, a batch normalization layer and a Relu activation layer, wherein the residual block in the gesture time merging network adopts grouping convolution, and the grouping number groups=17; the residual blocks in the gesture residual fusion network do not use packet convolution, and the packet number groups=1;
the gesture correction network consists of five parallel deformable convolutions, the expansion rates of the five deformable convolutions are 3, 6, 9, 12 and 15 respectively, each deformable convolution takes the result obtained by stacking a characteristic vector xi (h) and eta as input, a predicted Gaussian heat map is output, and the five Gaussian heat maps respectively output by the five convolutions are averaged to obtain a human gesture prediction result;
(4) And inputting the video images to be estimated of the continuous 3 frames into the human body posture estimation model, and outputting to obtain the posture estimation result of the person in the video image of the 2 nd frame, namely the coordinates of each key part of the human body.
2. The human body posture estimation method according to claim 1, characterized in that: for each frame of video image in the video data set in the step (1), detecting the position coordinates of the human body ROI in the image by a YOLOv5 algorithm, and amplifying the ROI by 25%.
3. The human body posture estimation method according to claim 1, characterized in that: the training process of the bidirectional continuity convolutional neural network in the step (3) is divided into two steps: firstly, training a Backbone network, then fixing parameters of the Backbone network, and training a gesture time merging network, a gesture residual error fusion network and a gesture correcting network.
4. A human body posture estimation method according to claim 3, characterized in that: the specific process of training the backhaul network is as follows: inputting human body ROIs in all video images of a sample into a back bone network one by one, calculating a loss function L1 between a human body posture prediction result output by the whole bidirectional continuity convolutional neural network and artificial mark information corresponding to the sample, repeatedly updating back bone network parameters through back propagation according to the loss function L1 until the loss function L1 converges, wherein the expression of the loss function L1 is as follows:
wherein: n is the number of marked key parts of the human body, H gt_i Converting coordinates of the ith key part artificial mark of all human ROIs in a group of samples to generate a Gaussian heat map superimposed result, H pred_i Converting coordinates of ith key part of all human ROIs in a group of samples, which are predicted and output by a bidirectional continuous convolutional neural network, into a result obtained by overlapping Gaussian heat maps, wherein II is II 2 Represents the L2 norm, v i Indicating whether the ith key part has a label in the sample image, and if so, taking the value as1, otherwise, the value is 0.
5. A human body posture estimation method according to claim 3, characterized in that: the specific processes of the training gesture time merging network, the gesture residual error merging network and the gesture correcting network are as follows: firstly, fixing trained parameters of a back bone network, then inputting human body ROIs in all video images of a sample into the back bone network one by one, calculating a loss function L2 between a human body posture prediction result output by the whole bidirectional continuous convolutional neural network and artificial mark information corresponding to the sample, repeatedly updating parameters of a posture time merging network, a posture residual error merging network and a posture correcting network through back propagation according to the loss function L2 until the loss function L2 converges, wherein the expression of the loss function L2 is as follows:
wherein: n is the number of marked key parts of the human body, G gt_i A Gaussian heat map, G, generated by converting coordinates of an ith key part artificial mark of a human body ROI in a 2 nd frame video image of a group of samples pred_i The method comprises the steps that a Gaussian heat map generated by converting the output coordinates of the ith key part of a human body ROI in a 2 nd frame video image of a group of samples through bidirectional continuous convolutional neural network prediction is II 2 Represents the L2 norm, v i Indicating whether the ith key part has a label in the sample image, if so, the value is 1, otherwise, the value is 0.
6. The human body posture estimation method according to claim 1, characterized in that: the specific implementation process of the step (4) is as follows: inputting human body ROI of the same person in the continuous 3 frames of video images to be estimated into a human body posture estimation model, outputting to obtain a Gaussian heat map, converting and calculating the Gaussian heat map to obtain key position coordinate information of the same person in the 2 nd frame of video images, mapping the coordinate information into the 2 nd frame of video images, and linking the key positions in sequence to generate a prediction result of the human body skeleton, thereby realizing human body posture estimation.
CN202011610311.1A 2020-12-30 2020-12-30 Human body posture estimation method based on bidirectional serialization modeling Active CN112633220B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011610311.1A CN112633220B (en) 2020-12-30 2020-12-30 Human body posture estimation method based on bidirectional serialization modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011610311.1A CN112633220B (en) 2020-12-30 2020-12-30 Human body posture estimation method based on bidirectional serialization modeling

Publications (2)

Publication Number Publication Date
CN112633220A CN112633220A (en) 2021-04-09
CN112633220B true CN112633220B (en) 2024-01-09

Family

ID=75286799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011610311.1A Active CN112633220B (en) 2020-12-30 2020-12-30 Human body posture estimation method based on bidirectional serialization modeling

Country Status (1)

Country Link
CN (1) CN112633220B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205043B (en) * 2021-04-30 2022-06-07 武汉大学 Video sequence two-dimensional attitude estimation method based on reinforcement learning
CN113627396B (en) * 2021-09-22 2023-09-05 浙江大学 Rope skipping counting method based on health monitoring
CN113920545A (en) * 2021-12-13 2022-01-11 中煤科工开采研究院有限公司 Method and device for detecting posture of underground coal mine personnel
CN115116132B (en) * 2022-06-13 2023-07-28 南京邮电大学 Human behavior analysis method for depth perception in Internet of things edge service environment
CN116386089B (en) * 2023-06-05 2023-10-31 季华实验室 Human body posture estimation method, device, equipment and storage medium under motion scene

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016190934A2 (en) * 2015-02-27 2016-12-01 Massachusetts Institute Of Technology Methods, systems, and apparatus for global multiple-access optical communications
WO2017133009A1 (en) * 2016-02-04 2017-08-10 广州新节奏智能科技有限公司 Method for positioning human joint using depth image of convolutional neural network
CN108932500A (en) * 2018-07-09 2018-12-04 广州智能装备研究院有限公司 A kind of dynamic gesture identification method and system based on deep neural network
CN111695457A (en) * 2020-05-28 2020-09-22 浙江工商大学 Human body posture estimation method based on weak supervision mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016190934A2 (en) * 2015-02-27 2016-12-01 Massachusetts Institute Of Technology Methods, systems, and apparatus for global multiple-access optical communications
WO2017133009A1 (en) * 2016-02-04 2017-08-10 广州新节奏智能科技有限公司 Method for positioning human joint using depth image of convolutional neural network
CN108932500A (en) * 2018-07-09 2018-12-04 广州智能装备研究院有限公司 A kind of dynamic gesture identification method and system based on deep neural network
CN111695457A (en) * 2020-05-28 2020-09-22 浙江工商大学 Human body posture estimation method based on weak supervision mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
轻量级双路卷积神经网络与帧间信息推理的人体姿态估计;陈昱昆;汪正祥;于莲芝;;小型微型计算机***(10);全文 *

Also Published As

Publication number Publication date
CN112633220A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
CN112633220B (en) Human body posture estimation method based on bidirectional serialization modeling
WO2023056889A1 (en) Model training and scene recognition method and apparatus, device, and medium
US11176381B2 (en) Video object segmentation by reference-guided mask propagation
Zheng et al. Cross-domain object detection through coarse-to-fine feature adaptation
Bao et al. Monofenet: Monocular 3d object detection with feature enhancement networks
Yin et al. FD-SSD: An improved SSD object detection algorithm based on feature fusion and dilated convolution
Fu et al. Let there be light: Improved traffic surveillance via detail preserving night-to-day transfer
CN112446342A (en) Key frame recognition model training method, recognition method and device
Cho et al. Semantic segmentation with low light images by modified CycleGAN-based image enhancement
CN112084952B (en) Video point location tracking method based on self-supervision training
CN113191204B (en) Multi-scale blocking pedestrian detection method and system
Wang et al. MCF3D: Multi-stage complementary fusion for multi-sensor 3D object detection
CN112036260A (en) Expression recognition method and system for multi-scale sub-block aggregation in natural environment
CN115376024A (en) Semantic segmentation method for power accessory of power transmission line
CN111626134A (en) Dense crowd counting method, system and terminal based on hidden density distribution
CN116092190A (en) Human body posture estimation method based on self-attention high-resolution network
CN114913342A (en) Motion blurred image line segment detection method and system fusing event and image
Xi et al. Implicit motion-compensated network for unsupervised video object segmentation
CN113269038A (en) Multi-scale-based pedestrian detection method
Zhang et al. Key point localization and recurrent neural network based water meter reading recognition
CN111738092A (en) Method for recovering shielded human body posture sequence based on deep learning
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
Lee et al. Alternative Collaborative Learning for Character Recognition in Low-Resolution Images
CN115953744A (en) Vehicle identification tracking method based on deep learning
Kim et al. Global convolutional neural networks with self-attention for fisheye image rectification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant