CN112131985B

CN112131985B - Real-time light human body posture estimation method based on OpenPose improvement

Info

Publication number: CN112131985B
Application number: CN202010953721.XA
Authority: CN
Inventors: 王柳懿
Original assignee: Tongji Institute Of Artificial Intelligence Suzhou Co ltd
Current assignee: Tongji Institute Of Artificial Intelligence Suzhou Co ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2024-01-09
Anticipated expiration: 2040-09-11
Also published as: CN112131985A

Abstract

The invention relates to a real-time lightweight human body posture estimation method based on OpenPose improvement, which comprises the steps of acquiring a 2D image frame, preprocessing the 2D image frame, inputting the preprocessed 2D image frame into a trained feature extraction module, and acquiring initial features by using a MobileNet V2 lightweight moduleFWill beFInputting the images into an initial module, and obtaining a joint point heat map of the figure gesture by utilizing a part of double-branch structureS ¹ And partial affinity graphL ¹ To be obtainedF、S ¹ 、L ¹ Inputting the serial data into a refining module, and acquiring a joint point heat map by utilizing a continuous small convolution layer structure and a partial double-branch structure which are fused with a residual error structureS ² And partial affinity graphL ² And connecting the detected joint points and limbs by using a greedy analysis method, and visually outputting. The invention reduces the quantity of parameters and calculation amount required by human body posture estimation, improves the execution speed on a CPU and a GPU, keeps the accuracy at an acceptable level, and is beneficial to the development and application of human body posture estimation on portable and embedded equipment.

Description

Real-time light human body posture estimation method based on OpenPose improvement

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a real-time lightweight human body posture estimation method based on OpenPose improvement.

Background

The human body posture estimation aims at predicting and connecting the positions of all the joints of the human skeleton through vision or depth information, and is an important technical means for realizing the application of human body action recognition, action teaching, man-machine interaction, smart city and the like. Human body posture estimation provides new possibilities for application of artificial intelligence, and aims to automatically identify meaning wanted to be conveyed by a target and respond correspondingly by monitoring behaviors of human beings, so that a phase mode of the human beings and the machine is more harmonious and efficient.

There are various classification modes for human body posture estimation algorithms: the method can be divided into 2D image pose estimation and 3D image pose estimation containing depth information according to the type of an estimation object; the estimated object number can be divided into single person human body posture estimation and multi-person human body posture estimation; the algorithm model can be divided into a top-down method and a bottom-up method; the human body posture estimation method can be divided into traditional human body posture estimation and human body posture estimation based on deep learning according to the development stage.

The bottom-up algorithm thinking is that all joint points in an image are detected and classified, and then all joints are matched with corresponding persons through a connection algorithm, namely the method mainly comprises two parts: joint detection and joint component clustering. The bottom-up method has the advantage of keeping the recognition speed substantially consistent without being limited by the number of people. The top-down algorithm thinking is to detect the person in the image first and then to detect the joints of each person. The top-down method has an advantage in that the recognition detection rate is high, but the time taken for recognition is linearly proportional to the number of detected persons, and is particularly slow when the number of persons in the figure is excessive.

Traditional algorithms can be largely divided into two major types, graph-structure-based models and feature-based direct regression. The main idea based on the graph structure model is to decompose the appearance of the target into a local part template according to the geometric constraint of each part in the practical experience, and the obtained structure can model the joint after each part is parameterized based on the position and the direction of the pixel point. The main idea of direct regression based on features is to solve the human body posture estimation problem directly as a classification or regression problem, but the accuracy of the method is relatively general, and the method is suitable for scenes with clean backgrounds.

The human body posture estimation algorithm based on deep learning mainly utilizes convolutional neural network CNN and various CNN-based derivative networks to extract characteristic information from images. Compared with the traditional method, the CNN can extract the feature vectors of the human joint points and the context information thereof under various receptive fields and various scales, and coordinate regression is carried out on the feature vectors to obtain the posture estimation result of the human body in the image. However, these tasks that rely on deep neural networks often also involve millions or even billions of parameter volumes, which are difficult to deviate from the image processing units GPUs with high computing power to achieve the same effect on other processing units. The deeper and more complex network architecture makes the size of the network often very large and difficult to deploy directly on portable mobile devices and embedded devices.

Disclosure of Invention

The invention aims to provide an OpenPose-based improved real-time light human body posture estimation method, which realizes the remarkable compression of a network model in parameter quantity and calculation quantity and the remarkable improvement in execution speed, and when the overlapping degree of characters in an image is not high, the accuracy can still be kept at a higher level.

In order to achieve the above purpose, the invention adopts the following technical scheme:

an openPose improvement-based real-time lightweight human body posture estimation method comprises the following steps:

s1, acquiring a 2D image frame, carrying out standardized preprocessing on the image,

s2, inputting the preprocessed image into a trained feature extraction module, and acquiring initial features by using a MobileNet V2 light weight moduleF，

S3, initial characteristics are setFInputting the images into an initial module, and respectively acquiring joint point heat maps of the figure gestures by utilizing partial double-branch structuresS ¹ And partial affinity graphL ¹ ，

S4, to be acquiredF、S ¹ 、L ¹ Inputting the serial data into a refining module, and acquiring a joint point heat map by utilizing a continuous small convolution layer structure and a part of double-branch structure which are fused with residual error structuresS ² And partial affinity graphL ² ，

And S5, connecting the detected joint points and limbs by using a greedy analysis method, and visually outputting a calculation result.

Preferably, the method comprises the steps of,in S2: scaling the input image to 256 pixels, and performing standardized preprocessing on the scaled image to satisfy the following formula: (img-mean）×scaleWherein, the method comprises the steps of, wherein,imgis the RGB value of the original image,meanthe method takes the component 128 of the method,scaletake 1/256.

Preferably, in S2: the feature extraction module uses the first 7 bottleneck structures of MobileNetV2 and will be the 5 th and 7 th bottleneck structuress2 becomess1。

Preferably, the initial module and the refining module in S3 and S4 are responsible for outputting the joint point heat map and the partial affinity map respectively, except that the last two continuous 1×1 convolution layers are dual-branch outputs, and the rest convolution layers are all merged convolutions.

Preferably, in S3: will initiate the featureFThe features are further extracted by inputting the features into three continuous 3×3 convolution layers, and the features are respectively input into two branches to perform two-layer 1×1 convolution, so as to obtain the joint point heat map respectivelyS ¹ And partial affinity graphL ¹ Respectively satisfy the formulaS ¹ =ρ ¹ （F），L ¹ =φ ¹ （F）。

Preferably, in S4: each continuous small convolution module with fused residual structure in the refining module uses three continuous 3×3 convolution layers, and inputs and outputs of each small convolution module are added by using an addition channel without parameters, and the addition outputs respectively meet the formulaS ² =ρ ² （F、S ¹ 、L ¹ ），L ² =φ ² （F、S ¹ 、L ¹ ）。

Preferably, in S3: the number of the output articulation points and the number of the partial affinity graphs are consistent with the actual human situation, and are respectively 19 articulation points and 38 partial affinities.

Preferably, the characteristic extraction and the joint point detection are carried out by adopting a coding and decoding structure network, so that the sizes of the visualized input image and the original input image are ensured to be the same.

Preferably, the accuracy rate is calculated by using a part of verification set in the network training process of the feature extraction module, and if the current accuracy rate is the highest, the current training model data is saved.

Preferably, after the joint points and partial affinities obtained through detection are respectively output to enter a connecting module after the refining module, the connecting module selects limbs with the least number in the graph to connect with the trunk of the spanning tree, the matching optimal problem is decomposed into a series of even matching sub-problems, the matching condition of each node is respectively determined, and the maximum limb matching is obtained.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the invention adopts the partial network structure modified by the MobileNet v2, the partial double-branch structure and the continuous small convolution layer structure fused with the residual error structure, realizes the light weight of the network model, realizes the detection and connection of the human object joint points and limbs in the 2D image by utilizing the convolution neural network and a greedy analysis algorithm, effectively reduces the parameter quantity and the calculated quantity of the model, is beneficial to saving the memory space and the calculation resource of hardware equipment, and keeps the accuracy rate and has better operation effect.

According to the invention, the lightweight improvement is carried out based on OpenPose, the lightweight network and the residual error structure are fused, the significant improvement of the execution speed on the CPU and the GPU is realized, the deployment of human body posture estimation on the embedded or portable equipment in the actual situation is friendly, and particularly, the detection result of the human skeleton accuracy rate can be shown while the lightweight and high-speed detection is kept under the conditions that the human body overlapping degree in the image is low and the background environment is not excessively noisy.

Drawings

FIG. 1 is a flow chart of the method of the present embodiment;

FIG. 2 is a schematic illustration of a human body joint and limb connection used in the present embodiment;

FIG. 3 is a schematic diagram of a continuous small convolution layer module with a fused residual structure according to the present embodiment;

fig. 4 is a diagram of the human body posture estimation result according to the present embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The openPose improvement-based real-time lightweight human body posture estimation method as shown in fig. 1 specifically comprises the following steps:

and 2D image frames are acquired through a video, a picture or a camera and the like, and standardized preprocessing is carried out on the images.

Inputting the preprocessed image into a trained feature extraction module, and acquiring initial features by using a MobileNet V2 lightweight moduleF. The specific method is as follows:

scaling the input image to 256 pixels, storing the scaling ratio so as to restore the original size of the image during output, and carrying out standardized preprocessing on the scaled image, wherein the following formula is satisfied: (img-mean）×scaleWherein, the method comprises the steps of, wherein,imgis the RGB value of the original image,meanthe method takes the component 128 of the method,scaletake 1/256.

And in the network training process of the feature extraction module, using a part of verification set to calculate the accuracy rate, and if the current accuracy rate is the highest, storing the current training model data. The training data set employs a COCO 2017 training set and a validation set. Setting the training period as 100 times, and selecting an Adam algorithm as an optimizer of the loss function. The feature extraction module uses the first 7 bottleneck structures of MobileNetV2 and will be the 5 th and 7 th bottleneck structuress2 becomess1. The detection results are shown in FIG. 4.

Will initiate the featureFInputting the images into an initial module, and respectively acquiring joint point heat maps of the figure gestures by utilizing partial double-branch structuresS ¹ And partial affinity graphL ¹ . The specific method is as follows:

will initiate the featureFThe features are further extracted by inputting the features into three continuous 3×3 convolution layers, and the features are respectively input into two branches to perform two-layer 1×1 convolution, so as to obtain the joint point heat map respectivelyS ¹ And partial affinity graphL ¹ Respectively satisfy the formulaS ¹ =ρ ¹ （F），L ¹ =φ ¹ （F）。

Because the weight and the feature map of the double branches on the same middle layer have certain similarity, the combination of most of the double branch structures into a single branch structure is beneficial to further saving space and reducing the resources consumed by calculation. End use of each branchL ₂ lossThe difference between the predicted value and the true value is calculated by the loss function, so that everyone is not marked in order to avoid some data sets, and therefore, the binary parameter is addedW（p) When the pixel isWhen the position of the position is not marked,W（p) Zero, avoiding penalizing the originally correct predictions, resulting in distortion of the training parameters.

According to the structural characteristics of the human body, as shown in fig. 2, the number of the output joint point heat maps is 19, namely a nose, a left eye, a right eye, a left ear, a right ear, a left shoulder, a right shoulder, a left elbow, a right elbow, a left wrist, a right wrist, a left hip, a right hip, a left knee, a right knee, a left ankle, a right ankle, a neck and a background; the number of partial affinity maps was 38.

To be obtainedF、S ¹ 、L ¹ Inputting the serial data into a refining module, and acquiring a joint point heat map by utilizing a continuous small convolution layer structure and a part of double-branch structure which are fused with residual error structuresS ² And partial affinity graphL ² . The specific method is as follows:

the input to the refining stage being an initial featureFJoint point heat mapS ¹ And partial affinity graphL ¹ Obtained after being connected in series, the output respectively meets the formulaS ² =ρ ² （F、S ¹ 、L ¹ ），L ² =φ ² （F、S ¹ 、L ¹ ）。

The receptive field is on the characteristic diagram of each layer of output in the depth convolution networkTo maintain the size of the receptive field, the original single 7×7 convolution layer is replaced by three continuous 3×3 convolution layers to reduce the size of the original imageIs added to the parameters and calculated amount. Meanwhile, due to depth deepening, the problem of sudden drop of accuracy rate caused by depth deepening is avoided by introducing a residual structure, and as shown in fig. 3, a continuous small convolution layer module fused with the residual structure adopts 5 small convolution layer modules in total to carry out series convolution in a network model.

And after the refining module, respectively outputting the detected joint points and partial affinities to enter a connection module, wherein the connection module selects the trunk of the limb connection spanning tree with the least number in the graph, decomposes the optimal matching problem into a series of even matching sub-problems, determines the matching condition of each node respectively, and searches the maximum limb matching by using a Hungary algorithm.

And connecting the detected joint points and limbs from bottom to top by using a greedy analysis method, and visually outputting a calculation result. Wherein: the principle of connecting limbs is that the same joint under the same joint point can not be connected with two joints under the other joint point at the same time, and the bottom-up expression mode for detecting the joint point and partial affinity can fully encode global context information, so that the greedy analysis method can greatly shorten the time for solving the problem and can achieve a higher precision result.

According to the embodiment, the three lightweight networks and residual error structures are combined, the partial structure of the OpenPose network is changed, and through testing, the execution speed of human body posture estimation can be remarkably improved on a CPU (Central processing Unit) and a GPU (graphics processing Unit), and the memory space and the computing resources required by a model are remarkably reduced. When the light source is sufficient and the overlapping degree between people is not high, the accuracy can still be at a higher level, and a better operation effect is achieved. The human body posture estimation method is beneficial to the development and application of human body posture estimation on portable or embedded equipment.

The above embodiments are provided to illustrate the technical concept and features of the present invention and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, and are not intended to limit the scope of the present invention. All equivalent changes or modifications made in accordance with the spirit of the present invention should be construed to be included in the scope of the present invention.

Claims

1. The real-time light human body posture estimation method based on OpenPose improvement is characterized by comprising the following steps of: comprising the following steps:

Will initiate the featureFThe features are further extracted by inputting the features into three continuous 3×3 convolution layers, and the features are respectively input into two branches to perform two-layer 1×1 convolution, so as to obtain the joint point heat map respectivelyS ¹ And partial affinity graphL ¹ Respectively satisfy the formulaS ¹ =ρ ¹ （F），L ¹ =φ ¹ （F），

The initial module and the refining module are responsible for outputting the joint point heat map and the partial affinity map respectively except that the last two continuous 1 multiplied by 1 convolution layers are double-branch output, the rest convolution layers are all combined convolution,

2. According to claimThe openwise improvement-based real-time lightweight human body posture estimation method of claim 1, which is characterized in that: in S2: scaling the input image to 256 pixels, and performing standardized preprocessing on the scaled image to satisfy the following formula: (img-mean）×scaleWherein, the method comprises the steps of, wherein,imgis the RGB value of the original image,meanthe method takes the component 128 of the method,scaletake 1/256.

3. The openwise improved real-time lightweight human body posture estimation method according to claim 1, wherein: in S2: the feature extraction module uses the first 7 bottleneck structures of MobileNetV2 and will be the 5 th and 7 th bottleneck structuress2 becomess1。

4. The openwise improved real-time lightweight human body posture estimation method according to claim 1, wherein: in S4: each continuous small convolution module with fused residual structure in the refining module uses three continuous 3×3 convolution layers, and inputs and outputs of each small convolution module are added by using an addition channel without parameters, and the addition outputs respectively meet the formulaS ² =ρ ² （F、S ¹ 、L ¹ ），L ² =φ ² （F、S ¹ 、L ¹ ）。

5. The openwise improved real-time lightweight human body posture estimation method according to claim 1, wherein: in S3: the number of the output nodes and the number of the partial affinity graphs are consistent with the actual situation of human beings.

6. The openwise improved real-time lightweight human body posture estimation method according to claim 1, wherein: and adopting a coding and decoding structure network to perform feature extraction and joint point detection.

7. The openwise improved real-time lightweight human body posture estimation method according to claim 1, wherein: and in the network training process of the feature extraction module, using a part of verification set to calculate the accuracy rate, and if the current accuracy rate is the highest, storing the current training model data.

8. The openwise improved real-time lightweight human body posture estimation method according to claim 1, wherein: and after the refining module, respectively outputting the detected joint points and partial affinities to enter a connection module, wherein the connection module selects the trunk of the limb connection spanning tree with the least number in the graph, decomposes the optimal matching problem into a series of even matching sub-problems, and respectively determines the matching condition of each node to obtain the maximum limb matching.